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PARAMETER  FOR  M-ESTIMATORS 

Robert  Michael  Bell,  Ph.D. 

Stanford  University,  1980 

Let  x^,...,xn  be  a  random  sample  from  a  distribution  symmetric 
about  the  unknown  location  parameter  0.  A  major  class  of  robust  esti¬ 
mators  of  location  is  the  class  of  M-estimators,  each  of  which  corre¬ 
sponds  to  a  function  ^  defined  on  the  reals.  To  be  scale  equivariant, 
these  estimators  require  the  use  of  a  scale  equivariant  function  of  the 
sample.  Commonly,  this  scale  parameter  is  chosen  to  be  a  constant 
times  the  sample  MAD  (median  absolute  deviation  from  the  median) . 

For  a  given  function  ip,  the  variance  of  the  corresponding 
M-estimator  varies  considerably  with  the  value  of  the  scale  parameter. 
It  is  therefore  proposed  that  the  value  which  minimizes  an  estimate  of 
the  asymptotic  variance  of  the  M-estimator  be  used  as  the  scaling 
factor.  This  adaptive  method  of  scaling  is  shown  to  be  asymptotically 
optimal  (under  fairly  general  conditions) ,  in  the  sense  that  the 
resulting  M-estimator  has  the  smallest  possible  asymptotic  variance 
among  all  M-estimators  based  on  ip.  In  particular,  when  the  underlying 
distribution  is  normal,  the  adaptive  estimator  based  on  any  reasonable 
\p  achieves  full  asymptotic  efficiency,  i.e.,  is  asymptotically  equiva¬ 
lent  to  the  sample  mean. 

The  performance  of  the  estimator  for  small  samples  is  investi¬ 
gated  by  Monte  Carlo  methods  for  several  choices  of  ip  using  the  tri¬ 
efficiency  criteria.  A  slight  modification  of  the  above  estimator 
compares  favorably  with  Tukey’s  bisquare  M-estimator  for  sample  sizes 
as  small  as  20. 
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AN  ADAPTIVE  CHOICE  OF  THE  SCALE 
PARAMETER  FOR  M-ESTIMATORS 


1.  Introduction 

Suppose  that  one  observes  a  random  sample  x^,...,x  from  a 
distribution  symmetric  about  the  unknown  location  parameter  0. 

Except  in  a  make-believe  world,  a  desirable  property  for  any  estima¬ 
tor  of  0  is  that  of  robustness.  We  list  two  qualitative  definitions 
of  robustness  of  location  estimators.  The  first  is  that  an  estimator 
is  robust  if  it  possesses  high  efficiencies  for  a  wide  set  of  likely 
distributions.  The  second  requires  that  the  estimator  be  highly 
efficient  for  a  particular  model  and  yet  resistant  to  a  small  amount 
of  contamination.  The  two  concepts  are  generally  compatible,  and 
they  are  both  presented  to  give  alternative  motivations  for 
robustness.  Certainly,  many  other  definitions  can  also  be  given; 
e.g.,  see  Huber  (1972). 

A  major  class  of  robust  estimators  is  the  class  of 
M-estimators  of  location.  In  order  to  be  scale  equivariant,  these 
estimators  require  the  use  of  a  scale  equivariant  function  of  the 
sample  s(x).  Commonly  s(x)  is  a  constant  times  the  sample  MAD 
(median  absolute  deviation  from  the  median) .  In  this  paper  we  pro¬ 
pose  using  the  value  of  the  scale  parameter  which  minimizes  the  esti¬ 
mated  asymptotic  variance  of  the  M-estimator.  Under  fairly  general 
conditions  this  converges  to  the  best  possible  value  for  the  scale 
parameter,  resulting  in  the  smallest  possible  asymptotic  variance  for 
a  fixed  function  \jj.  In  particular,  full  asymptotic  efficiency  is 
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attained  for  the  normal  distribution.  In  Section  10  we  present  small 
sample  (n  =  20)  Monte  Carlo  results  which  compare  a  slight  modifica¬ 
tion  of  the  above  proposal  with  Tukey's  bisquare  using  the  trieffi¬ 
ciency  criteria.  Even  for  this  sample  size  the  adaptive  estimator 
compares  favorably  with  the  bisquare. 

While  the  analysis  in  this  paper  is  limited  to  estimation  of 
location,  M-estimation  may  also  be  used  on  the  more  important  problem 
of  multiple  linear  regression;  e.g.,  see  Andrews  (1974).  There 
appears  to  be  a  straightforward  extension  of  the  methods  described 
here  to  the  regression  problem. 

2.  Robust  Estimators  of  Location 

The  nonrobustness  of  the  sample  mean — and  other  least  squares 
procedures — is  well  documented.  The  mean  has  very  low  efficiency  for 
long-tailed  distributions  and  gives  far  too  much  weight  to  gross 
outliers.  Robust  alternatives  to  the  mean  are  usually  limited  to 
estimators  which  are  location  and  scale  equivariant.  In  accordance 
with  Berk  (1967),  an  estimator  T(x)  is  location  and  scale  equivariant 
if  T(ax  +  bl)  =  aT(x)  +  b,  for  all  a  >  0.  We  note  that  the  term 
invariant  is  also  used  by  some  authors  for  the  above  property . 

The  most  commonly  used  nonadaptive  estimators  of  location 
fall  into  three  categories: 

(i)  L-estimators .  Linear  combinations  of  order  statistics.  These 
include  trimmed  means  which  in  turn  include  the  mean  and  median  as 
extremes . 
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(ii)  M— es timators .  Analogues  of  maximum  likelihood  estimators 
(MLEs) .  The  M-estimator  0  is  a  root  of  an  equation  of  the  form 
^ [(x^  -  8) /s(x) ]  =  0,  where  ip  is  a  skew-symmetric  function. 

(iii)  R— estimators .  Midpoints  of  symmetric  confidence  intervals 
based  on  linear  rank  tests.  The  best  known  of  these  is  the  Hodges- 
Lehmann  (1963)  estimator,  which  turns  out  to  be  the  median  of  the 
pairwise  averages  of  the  observations. 

Examples  of  each  of  these  types  of  estimators  are  contained 
in  the  books  by  Andrews,  et  al.  (1972)  and  Huber  (1977).  Johns 
(1979)  has  introduced  P-estimators ,  which  are  robust  analogues  of 
Pitman  estimators. 

Our  further  attention  will  be  devoted  almost  exclusively  to 
M-estimators.  The  main  reason  for  this  is  one  given  by  Hampel 
(1974),  "Furthermore,  the  stress  on  M-estimates  is  not  accidental; 
neither  L-  nor  R-estimates. . .allow  a  proper  rejection  of  outliers, 
i.e.,  a  rejection  based  on  the  distance  from  the  bulk  of  the  data." 
This  property  of  M-estimators  is  closely  related  to  the  fact  that 
their  influence  curves  (defined  in  Section  4)  are  essentially  inde¬ 
pendent  of  the  true  distribution  function  F.  We  note  that  this  prop¬ 
erty  is  shared  by  P-estimators.  Another  reason  for  preferring 
M-estimators  is  that  because  of  their  close  association  to  MLEs,  they 
can  be  generalized  to  a  large  number  of  models. 

Just  as  there  are  many  definitions  of  robustness,  there  are 
many  criteria  for  judging  robust  estimators.  The  finite  sample 
variances  (or  equivalently  efficiencies)  for  various  symmetric 
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distributions  are  often  used.  The  use  of  the  variance  is  justified 
by  the  argument  that  symmetry  implies  unbiasedness  and  good  estima¬ 
tors  tend  to  be  approximately  normally  distributed  except  when  based 
on  small  samples  from  long-tailed  distributions.  Monte  Carlo  tech¬ 
niques  are  usually  required  to  obtain  these  variances.  The  most 
ambitious  study  of  this  kind  is  the  Princeton  Study  by  Andrews, 
et  al.  (1972) .  Their  study  includes  65  estimators  and  30  sampling 
"situations"  (distribution  and  sample  size) .  Sample  sizes  ranged 
from  5  to  40. 

Except  for  the  obvious  choice  of  the  normal,  selecting  from 
among  the  many  distributions  is  difficult.  The  results  presented  in 
this  paper  are  based  on  Tukey's  (1979)  concept  of  triefficiency. 

These  are  three  diverse  sampling  situations  for  which  a  robust  esti¬ 
mator  should  do  well.  In  each  case  a  sample  size  of  n  =  20  is  used. 
The  three  situations  are: 

1.  Standard  normal. 

2.  One  wild  normal  (1WN) :  19  standard  normals,  and  1  N(0,100) 

in  each  sample. 

3.  Slash:  a  standard  normal  random  variable  divided  by  an 
independent  uniform  (0,1). 

The  asymptotic  variance  is  often  another  useful  tool. 
Comparison  of  the  asymptotic  variances  of  similar  estimators  may  give 
a  reasonable  idea  of  their  relative  performances  for  even  fairly 
small  sample  sizes.  Care  must  be  taken  since  the  rates  of  conver¬ 
gence  of  the  finite  sample  variance  to  the  asymptotic  variance  may 
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differ  greatly  for  dissimilar  estimators.  For  example,  the  conver¬ 
gence  for  adaptive  estimators,  defined  in  the  next  section,  will  tend 
to  be  slower  than  that  for  nonadaptive  ones. 

Another  set  of  criteria  based  on  resistance  to  contamination 
are  set  forth  by  Huber  (1977)  and  Hampel  (1974)  .  These  include 
breakdown  point,  maximum  bias  and  variance  as  a  function  of  the 
amount  of  contamination,  and  gross  error  sensitivity. 

3.  Adaptive  Estimators 

The  search  for  good  robust  estimators  of  location  often 
becomes  an  attempt  to  find  estimators  which  are  efficient  for  both 
the  normal  distribution  and  longer  tailed  alternatives  like  the 
slash.  Unfortunately,  tradeoffs  must  be  made.  The  mean,  which  is 
optimal  for  the  normal,  does  terribly  once  there  is  even  a  moderate 
amount  of  contamination.  Simple  estimators  which  do  very  well  for 
long-tailed  distributions  sacrifice  too  much  efficiency  for  the 
normal.  Inevitably  some  compromise  must  be  made  between  the  extreme 
objectives  if  we  limit  ourselves  to  the  nonadaptive  estimators 
described  in  the  last  section. 

The  motivation  for  adaptive  estimators  is  based  on  two  facts. 
First,  if  the  approximate  shape  of  the  sampling  distribution  is 
known,  it  is  not  too  difficult  to  find  an  estimator  which  will  do 
well  by  most  standards.  Second,  considerable  information  about  the 
shape  of  the  sampling  distribution  is  contained  in  most  samples  of  a 
moderate  size.  Adaptive  estimators  try  to  use  the  information  in  the 
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sample  to  select  an  estimator  which  Is  appropriate  for  the  distribu¬ 
tion  from  which  the  sample  came. 

In  general,  adaptive  estimators  have  the  following  form.  A 
set  of  estimators,  finite  or  infinite,  is  chosen.  At  a  minimum  the 
set  would  include  an  estimator  designed  for  short-tailed  distribu¬ 
tions  and  another  for  long-tailed  ones.  Other  estimators  might  be 
included  to  handle  other  sized  tails  and  various  degrees  of 
peakedness.  Once  the  set  of  estimators  is  given,  one  needs  a  choice 
function  which  maps  the  set  of  possible  order  statistics  into  the  set 
of  estimators.  The  objective  of  the  choice  function  is  to  choose  the 
estimator  which,  in  some  sense,  best  suits  the  data. 

A  good  survey  article  on  adaptive  estimators  is  that  of  Hogg 
(1974) .  He  proposes  a  very  simple  adaptive  estimator  which  chooses 
from  a  small  number  of  trimmed  means  (including  a  mean  of  extreme 
order  statistics)  based  on  the  value  of  the  Q  statistic.  The  Q 
statistic  is  a  measure  of  tail  weight  based  on  the  outer  order 
statistics.  Hogg  also  suggests  that  the  degree  of  skewness  of  the 
sample  may  be  used  to  help  choose  the  estimator.  Other  examples  of 
adaptive  estimators  are  described  in  Andrews,  et  al.  (1972,  Sections 
2B3  and  2E1) . 

An  adaptive  estimator  of  particular  interest  is  due  to 
Jaeckel  (1971).  His  estimator  is  an  a-trimmed  mean  where  a  e[an,a  ] 

f  0  1 

is  chosen  adaptively.  His  method  is  to  use  the  value  of  a  which  min¬ 
imizes  the  estimated  asymptotic  variance  of  the  trimmed  mean  as  a 
function  of  a.  As  n  the  trimming  proportion  converges  in 
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probability  to  the  optimal  trimming  proportion  in  the  given  range, 
and  thus  the  asymptotic  variance  of  Jaeckel's  estimator  is  the  same 
as  if  the  optimal  trimming  proportion  were  known.  He  also  outlines 
how  this  method  may  be  used  to  choose  from  among  more  complex  fami¬ 
lies  of  L-estimators  which  are  indexed  by  two  or  more  parameters. 
Using  the  estimated  variance  to  choose  from  a  finite  set  of  estima¬ 
tors  is  also  discussed  by  Switzer  (1970). 

A  more  ambitious  class  of  adaptive  estimators  was  introduced 
by  Stein  (1956)  who  argued  that  estimators  could  be  constructed  which 
would  be  fully  efficient  in  terms  of  asymptotic  variance  for  all  sym¬ 
metric  densities  with  finite  Fisher  information.  His  argument  was 
that  as  n  ->  «>,  the  density  and  its  derivative  can  be  estimated  suffi¬ 
ciently  well  from  the  data  to  produce  an  estimator  with  variance 
approaching  the  Cramer-Rao  lower  bound.  Several  authors  have  pre¬ 
sented  estimators  with  this  property.  R-estimators  were  used  by 
Van  Eeden  (1970)  and  Beran  (1974),  while  Stone  (1975)  presented  an 
estimator  based  on  M-estimators .  Sequences  of  L-estimators  with 
efficiencies  converging  to  one  have  been  given  by  Takeuchi  (1971) , 
Johns  (1974) ,  and  Sacks  (1975) .  Beran  (1977)  has  shown  that  a  mini¬ 
mum  Hellinger  distance  estimator  is  also  fully  efficient. 

While  adaptive  estimators  are  generally  designed  to  have 
excellent  large  sample  properties,  careful  consideration  must  be 
given  to  their  small  sample  behavior.  Estimators  which  try  to 
gather  too  much  information  from  the  data  will  tend  to  "overadapt." 

In  that  case  a  very  wrong  inference  about  the  true  distribution  may 
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be  taken  from  some  samples  resulting  in  a  very  bad  estimate.  Thus 
the  amount  of  adaptation  must  be  carefully  pegged  to  the  sample 
size- — with  more  complex  procedures  being  reserved  for  larger  samples. 

For  small  sample  sizes  on  the  order  of  n  =  20,  for  which  only 
simple  adaptive  estimators  are  practical,  they  have  tended  to  do  no 
better  than  good  nonadaptive  estimators.  As  n  increases,  the  adap¬ 
tive  estimators  improve  relative  to  nonadaptive  ones.  Thus  some 
improvement  could  be  expected  for  moderate  sized  samples  using  some 
of  the  fairly  simple  adaptive  estimators.  Less  can  be  said  about  the 
performance  of  fully  efficient  estimators  for  small  and  moderate 
sized  samples.  The  theoretical  results  for  these  estimators  promise 
nothing  about  their  performance  for  finite  n,  even  if  n  is  very 
large.  However,  simple  estimators  from  the  sequences  of  Takeuchi  and 
Johns  performed  well  in  the  Princeton  study.  Encouraging  results  are 
also  obtained  for  n  =  40  by  Stone  (1975) . 

In  this  paper  we  present  an  adaptive  M-estimator  for  which 
the  function  Jj)  is  fixed  and  the  scale  parameter  s(x)  is  chosen  in  the 
same  way  as  Jaeckel's  trimming  proportion.  This  adaptation  is  simple 
in  the  sense  that  only  one  parameter  is  chosen  adaptively.  In  fact, 
no  more  parameters  than  usual  are  involved  in  the  estimator,  since 
s(x)  always  depends  on  the  data.  Even  so,  adaptively  choosing  the 
scale  parameter  can  lead  to  substantial  improvement.  This  idea  was 
mentioned  as  Proposal  3  of  Huber  (1964),  but  there  appears  to  have 
been  no  previous  attempt  to  pursue  it. 
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4.  M-Estimators 


M-estimators  of  location  were  introduced  in  1964  by  Peter 
Huber  as  analogues  of  maximum  likelihood  estimators.  For  a  given 
function  p  the  M-estimator  0  is  defined  as  the  value  of  0  which  mini¬ 
mizes  P  [(x^  -  0) /s (x)  ] .  The  sample  scale  parameter  s(x)  is  nec- 

A 

essary  to  make  0  scale  equivariant.  Discussion  of  its  importance  is 
postponed  until  Section  5.  It  is  typically  more  convenient  to  define 

A 

0  as  the  solution  to 


(4.1) 


n 

E 

i=l 


* 


s(x) 


0 


where  ijj  =  p ' ,  even  though  (4.1)  may  have  multiple  roots  when  p  is  not 
convex.  If  (4.1)  does  have  multiple  roots,  then  care  must  be  taken 
to  find  the  correct  root.  Obviously  0  is  not  changed  if  \p  is  multi¬ 
plied  by  a  constant. 

The  influence  curve  is  a  powerful  tool  in  the  analysis  of 
robust  estimation.  Suppose  that  an  estimator  T  is  a  functional  of 
the  empirical  distribution  function.  The  influence  curve  of  T  evalu¬ 
ated  at  the  distribution  F  is  the  function 

(4.2)  IC  (x)  =  lim  ~[T ( ( 1  —  e)F  +  e  6  )-T(F)] 

T’F  eiO  e 


where  6^  is  the  distribution  function  of  a  mass  point  at  x. 

Essentially  IC  „(x)  measures  the  expected  change  in  T(F  )  from  add- 
t  >  r  n 

ing  a  small  amount  of  mass  to  F  at  the  point  x.  Hampel  (1974)  pro¬ 
vides  an  excellent  discussion  of  the  relationships  between  an 
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estimator's  robustness  properties  and  its  influence  curve. 

In  general,  the  influence  curve  depends  on  T  and  F  through  a 

A 

complex  relationship.  For  the  M-estimator  8  given  by  (4.1)  however. 


(4.3) 


IC0,F  (X) 


x-  0 


x(F) 


r 


t-0 


s(F) 


dF(t) 


is  proportional  to  \p  at  any  distribution  function  F.  This  fact  makes 

the  analysis  of  M-estimators  (and  P-estimators,  which  have  a  similar 

expression  for  in  (x))  using  the  influence  curve  easier  than  the 
T ,  F 

analysis  of  L-  or  R-estimators .  Since  ICg  (x)  is  so  closely  related 

W  ,F 

to  the  function  Tp,  ip  is  often  referred  to  as  the  influence  curve  of 

A  A 

0.  Furthermore,  the  shape  of  \p  can  be  chosen  to  give  0  desirable 
properties . 

The  influence  curves  of  six  M-estimators  are  shown  in 
Figure  1.  The  mean  is  not  robust  since  its  influence  curve  is 
unbounded.  While  the  median  has  a  bounded  influence  curve,  it  is 
inefficient  at  the  normal  distribution  because  of  the  large  jump  at 
zero.  For  an  estimator  to  have  good  efficiency  at  the  normal,  its 
influence  curve  should  be  approximately  linear  near  zero.  Huber's  ^ 
combines  the  best  features  of  the  mean  and  median.  Hampel's 
"redescending"  ip  reflects  his  argument  that  extreme  observations 
should  have  very  little  or  no  influence.  Two  other  smoothed  redes¬ 
cending  ip  functions  are  Andrew's  sine  and  Tukey's  bisquare.  Huber's 
\p  is  likely  to  be  a  good  choice  if  one  is  interested  in  the  normal 
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Figure  1 


with  only  a  small  amount  of  contamination.  A  redescending  ip  is 
better  if  one  is  also  worried  about  the  possibility  of  very  heavy 


tails. 


At  this  point  we  list  some  assumptions  which  it  will  be  con¬ 
venient  to  make  about  any  subsequent  functions  ip . 

(Al)  ip  is  skew  symmetric;  i.e.,  \p(~x )  =  -ip(x). 

(A2)  ip(x)  >_  0  for  all  x  0. 

(A3)  ip(x)  is  continuous  and  piecewise  differentiable. 

(A4)  =  1. 

(A5)  ip"  (0)  =  0. 

(A6)  ip'"(0)  <  0. 

The  sine  and  bisquare  satisfy  all  the  assumptions;  but  Huber's  and 
Hampel  s  i|j  s  fail  A6,  an  assumption  which  is  convenient  when  the 
scale  is  chosen  adaptively.  Assumption  A4  is  possible  since  \p  may  be 
multiplied  by  a  constant  without  changing  0 .  We  list  A5  separately 
even  though  it  is  implied  by  Al  and  A6. 


Solving  (4.1)  usually  requires  an  iterative  procedure  which, 
of  course,  must  be  stopped  after  a  finite  number  of  steps.  For  the 
location  problem,  one  or  two  iterations  from  a  good  starting  point  is 
often  adequate.  The  one  step  M— estimator  (or  m^-estiroator)  with 
starting  point  is 


(4.4) 


/\ 


+ 


n 

r,  V  .1. 

xi  -  Mo' 

o  Li 

~  i=] 

L  1 

w 

/"'S 

i  X 

n 

E  \p' 
i=l 

xi-Mo 

.  s(x)  , 

The  usual  starting  point  is  the  median. 
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5.  The  Scale  Parameter 


The  scale  parameter,  the  function  of  the  data  required  to 

A 

make  0  scale  equivariant,  must  satisfy  s(ax  +  bl)  =  a  •  s(x),  for  a  >  0. 
While  usually  given  little  attention  in  previous  studies  of 
M-estimators ,  the  scale  parameter  is  key  to  the  performance  of  the 
estimator.  Simply  stated,  when  s(x)  is  small,  (x±-0)/s(x)  is  more 
likely  to  be  on  the  wings  of  ip  where  little  or  no  weight  is  given  the 
observations.  The  result  is  a  very  outlier  resistant  estimator .  When 

A 

s(x)  is  large,  (x^-0)/s(x)  tends  to  fall  on  the  approximately  linear 
part  of  ip  where  the  observations  receive  nearly  equal  weights.  Most 
commonly,  s(x)  is  a  constant  times  the  median  absolute  deviation  from 
the  median  (MAD)  given  by 

(5.1)  MAD  =  median  { | x^  -  med(x) | } 


In  this  case  the  constant  multiplying  the  MAD  must  compromise  between 
the  objectives  for  light  and  heavy  tails.  The  same  compromise  is  nec¬ 
essary  for  MLE  type  scale  parameters  solving  Y  [(x^  -  0) /s(x) ]  =  0. 

/\ 

As  s(x)  -*■  °°,  the  estimate  0  (x)  converges  to  the  sample  mean. 
Because  the  mean  is  optimal  for  the  normal,  it  is  convenient  to  have 
it  as  a  finite  point  in  the  set  of  estimators.  This  is  accomplished 
by  letting 


(5.2)  A(x)  =  1/ s (x)  , 

/\ 

so  that  0  converges  to  the  mean  as  A(x)  -*  0.  We  will  refer  to  X  as 
the  scale  factor.  For  doing  analysis,  the  resulting  parameterization 
appears  to  be  more  natural.  The  equations  defining  0  and  0^  become 
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(5.3) 


Z  i|>[A(x.  -0)]  =  0 
i=l  1 


and 


(5.4) 


^  Z  \p[\(xi~  Mq)  ] 

6(1)  =  M0  +  A  Z  ^’[A(x.-M0)] 


for  X  =  A(x)  >0.  In  each  case,  the  limit  as  X  •*  0  is  x.  The  esti¬ 
mators  are  equivariant  if  and  only  if  A(ax  +  bl)  =  A(x)/a,  for  a  >  0. 


6 .  Asymptotic  Variance  of  M-Estimators 

Under  fairly  mild  regularity  conditions  on  the  true 
distribution,  most  robust  estimators  of  location  have  a  limiting 
normal  distribution  of  the  form 


(6.1)  /n  (T(x)  -  0)  ►  N(0,V(T,F))  . 

The  variance  of  the  limiting  distribution,  V(T,F),  is  called  the 

_i* 

asymptotic  variance  of  T.  Suppose  that  A(x)  =  A  +  0^(n  2)  as  n  -*■ 
then  for  a  given  function  ip  the  asymptotic  variance  of  the 

A 

M- estimator  0  is 


(6.2) 


V(A;»,F)-  E<WX-0)]  . 

A  [E  ip  [A(X  -  0)  ]  ]Z 


The  asymptotic  variance  of  the  one-step  M-estimator  0^.^  is  also 

V (A ;  \p,F). 

In  Section  5  we  gave  a  heuristic  argument  to  suggest  that  for 
fixed  \p,  small  values  of  A  are  good  for  short  tailed  distributions 
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while  larger  values  of  A  are  best  for  long  tailed  ones.  Figure  2 
shows  the  asymptotic  variance  of  the  bisquare  as  a  function  of  A  for 
seven  symmetric  densities.  The  densities  have  been  scaled  to  have 
the  same  MAD,  and  the  variances  are  normalized  by  dividing  by  the 
Cramer-Rao  lower  bound,  [/(f'(x)  /f(x))dx]  .  For  all  except  the 

most  peaked  densities  (Laplace  and  Cauchy) ,  there  exists  some  A  such 
that  the  variance  is  1.04  or  less;  the  minimum  for  the  Cauchy  is 
1.14.  However,  the  points  where  the  minima  are  achieved  are  well 
dispersed.  For  the  normal  the  minimum  occurs  at  A  =  0,  and  as  the 
weight  of  the  tails  increase  the  optimal  value  of  A  does  also.  If  a 
compromise  value  of  A  is  chosen,  then  at  least  5  to  10  percent  effi¬ 
ciency  must  be  sacrificed  for  both  the  normal  and  the  slash.  We  note 
that  for  each  F,  V(A;  xp , F)  +  ®  as  A  This  is  because  the 

bisquare  redescends  to  zero  allowing  only  a  fraction  of  the  data  to 
be  used.  As  A  -*  this  fraction  converges  to  zero. 

Figure  3  compares  asymptotic  and  finite  sample  (n=20)  vari¬ 
ances  for  three  distributions.  The  contaminated  normal  for  which  the 
asymptotic  variances  are  given  in  Figures  2  and  3  is  95%  N(0,1)  and 
5%  N(0,100).  The  finite  sample  variance  is  for  the  one  wild  normal — 
19  points  N(0,1)  and  1  point  N(0,100).  Except  for  the  normal  distri¬ 
bution,  the  Cramer-Rao  lower  bound  is  generally  not  attainable  for 
finite  sample  sizes.  For  the  slash,  the  sample  size  20  variances  are 
up  to  15  percent  higher  than  the  corresponding  asymptotic  variances. 
Even  so,  the  shapes  of  the  curves  are  similar,  illustrating  that 
comparable  sacrifices  must  be  made  in  choosing  a  single  value  of  A. 
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Figure  2 


V(A)  Asymptotic  Variance  of  Bisquare  as  a 

CRLB  Function  of  A  for  Several  Densities 
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Figure  3 


V(A)  Comparison  of  Asymptotic  and  Finite  Sample 

CRLB  Variances  of  Bisquare  as  a  Function  of  A 
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Because  of  the  sharp  peak  in  the  Laplace  density,  the 
bisquare  does  very  poorly  for  all  values  of  X.  Someone  wishing  to  do 
well  for  the  Laplace  would  probably  choose  a  function  ip  which  is 
monotone.  Figure  4  shows  the  asymptotic  variance  of  Huber's  estima¬ 
tor  as  a  function  of  X  for  the  same  densities  as  in  Figure  2.  For 
Huber  s  \p,  0  converges  to  the  sample  median  as  X  -»-<».  Thus  0  becomes 
efficient  for  the  Laplace  as  X  -»■<».  This  improvement  comes  at  the 
expense  of  increased  optimal  variances  for  the  contaminated  normal 
(1.11)  and  slash  (1.12). 

7 .  Adaptive  Choice  of  X 

The  graphs  in  the  previous  section  show  that  for  a  given 
function  the  choice  of  X  makes  a  considerable  difference  in  the 
variance  of  the  M— estimator.  If  we  could  always  use  X  equal  to  or 
nearly  equal  to  the  value  which  minimizes  V(X)  =  V(X;  t|j,F),  say  XQ, 
the  resulting  estimator  would  do  exceptionally  well.  This  suggests 
trying  to  estimate  Xq  from  the  data.  The  simplest  idea  is  to  esti¬ 
mate  the  function  V(X)  by  a  function  V(X)  depending  on  the  data  and 

/\ 

use  the  value  of  X  which  minimizes  V(X) ,  in  either  (5.3)  or  (5.4). 

^  & 

We  denote  this  value  of  X  by  X  (x) ,  or  just  X  ,  and  the  resulting 

•  n  *  n* 

estimators  by  0  and  8^.^.  Although  we  shall  find  it  necessary  to 
make  a  minor  modification  in  this  definition  of  X*,  the  flavor  will 
remain  the  same. 

This  procedure  is  promising  for  two  reasons.  First,  under 
most  conditions  of  interest  X  will  converge  to  Xq  as  n  °°,  and  the 
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Figure  4 


Asymptotic  Variance  of  Huber's  Estimator 
as  a  Function  of  A  for  Several  Densities 


<Kx) 


-1  x  <  -1 

i  x  -1<_x<_1 

1  X  >  1 


•  normal 

•  cont.  norm. 


•  logistic 

•  slash 

•  t(3) ,  Cauchy 


•  Laplace 


asymptotic  variance  of  the  adaptive  estimator  will  be  V(A  ) ,  (see 
Theorems  2-7  in  Section  9).  If  1 p  is  the  bisquare,  then  the  asympto¬ 
tic  efficiency  will  be  0.96  or  higher  for  five  of  the  seven  distribu¬ 
tions  in  Figure  2.  The  asymptotic  efficiency  for  the  normal  will  be 
one.  Second,  the  fact  that  only  one  parameter  is  being  estimated 
from  the  data  means  that  0  should  be  stable  for  even  fairly  small  n. 
Working  with  the  scale  factor  essentially  allows  us  to  change  the 
shape  of  the  influence  curve  adaptively  without  adapting  ip.  The 
influence  curve  transforms  smoothly  from  nearly  linear  (small  A)  to 
quickly  redescending  (large  A) . 

For  skew-symmetric  t ji  the  asymptotic  variance  V(A;  tJj,F) 
depends  on  X  only  through  the  distribution  of  [ X  -  0 | .  If  we  assume 
that  F  is  symmetric,  it  is  reasonable  to  let  the  estimated  variance 

/s 

V(A)  depend  only  on 

(7>1)  yi  =  |xi  -  M|  for  i  =  l,2,/..,n 

where  M  is  an  estimate  of  6.  So  that  it  may  be  computed  directly,  we 
have  let  M  equal  the  sample  median.  Other  robust  estimates  could 
also  be  used.  A  choice  of  particular  interest  is  the  M-estimate  sol¬ 
ving  ^ (X£ ~ M) ]  =  0.  Unfortunately,  since  it  depends  on  A,  the 

calculation  of  A  is  made  more  difficult.  Replacing  (X-0)  by  y.  and 
integrals  by  summations  in  (6.2),  we  get 


(7.2) 


V(A) 


n  I  ip(Xy.)2 
i=l  1 


i=l 


2  ' 


ray.) 


20 


Figures  5-7  show  V(A)  for  samples  of  size  20  from  each  of  the 
triefficiency  distributions.  In  each  case  ip  is  the  bisquare,  and  the 
range  of  X  (in  terms  of  the  sample  MAD)  is  twice  that  of  Figure  2. 
Starting  with  the  normal  distribution  in  Figure  5,  the  first  observa- 

A 

tion  to  make  is  that  V(A)  often  has  multiple  relative  minima.  This  is 

A 

caused  by  the  extreme  instability  of  V(A)  as  X  increases.  The  six 

A 

graphs  fairly  well  represent  the  likely  patterns  of  V(A).  After 
having  a  relative  minimum  (usually  at  or  near  X  =  0) ,  the  estimated 
variance  may  rocket  to  unreasonable  heights  (in  the  hundreds  or 
thousands)  before  plummeting,  often  below  its  original  minimum.  The 

A 

volatility  of  V(A)  is  caused  mainly  by  the  large  relative  changes 
which  can  occur  in  E  ty'(Ay^)  as  that  sum  becomes  small.  Similar  pat¬ 
terns  appear  in  the  graphs  for  the  1WN  and  slash.  It  is  clear  that 
past  the  first  relative  minimum  the  estimated  variance  in  small  sam¬ 
ples  cannot  be  trusted.  Thus  any  subsequent  relative  or  absolute 
minima  must  be  considered  spurious. 

A 

In  contrast,  the  behavior  of  V(A)  up  to  the  point  of  the  first 

A 

relative  minimum  is  usually  quite  reasonable.  This  is  because  V(A) 
generally  has  a  relative  minimum  before  E  \p'(Xy^)  becomes  small  enough 

A  A 

to  make  V(A)  unreliable.  This  is  not  to  say  that  V(A)  will  be  an 
accurate  estimate  of  V(A)  for  n  as  small  as  20.  Fortunately,  that  is 

A 

unnecessary.  What  is  necessary  is  for  the  shapes  of  V(A)  and  V(A)  to 
be  similar  enough  so  that  X  adequately  approximates  Aq. 

These  considerations  lead  us  to  modify  the  original  proposal 

*  A 

and  to  let  X  be  the  smallest  A  >.  0  such  that  V(A)  has  a  relative 
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minimum  at  A  .  We  note  that  X  must  exist  for  redescending  ip  since 
£  ^(^y±)  eventually  goes  to  zero,  at  which  point  V(X)  goes  to  +°°  . 
Examination  of  the  graphs  in  Figures  5-7  shows  that  there  is  some¬ 
times  a  relative  minimum  at  X  =  0.  In  those  cases  0*  is  the  sample 
mean.  In  all  other  cases  V(A)  has  a  relative  maximum  at  X  =  0.  This 
fact  is  verified  by  expanding  V(X)  in  a  Taylor  series  about  A  =  0. 
Using  A1-A6  on  equation  (7.2)  gives 


(7.3) 


V(X) 


n  *  [Ay.  +  f  A3  y^’"(0)  +  o(A3)]2 
X2[Z(1  +  \  X2  y2r’(0)  +  o(A2))]2 


1  v  2  .  ,2 

-I  y.  +  X 
n  -'l 


1  v  4  l  2,2 

— Ey.  -  (—  E  y.) 
3n  ;i  n 


ip"’  (0)  +  o(A  ; 


Thus  we  have 


~  1  n  2  1  n  2 
(7.4)  V(0)  =  —  E  yf  =  -  E  (x.-M)  , 

n  .  '  i  n  .  -  i 

x=l  i=l 


(7.5)  V' (0)  =  0  , 


and 


(7.6)  V"(0)  = 


n 


-  E  (x.  -M)  -  3 


n  i=l  1 


n 


1  “  2 
—  E  (x.  -  M) 

n  i=l  1 


A 

Since  V'(0)  =  0,  there  is  a  relative  minimum  at  X =  0  (except 

for  a  set  of  measure  zero)  if  and  only  if  V"(0)  >  0.  Since  \p'”(0)  <  0 

*  ^E(x  -M)4 

by  A6,  X  =  0  if  and  only  if  K(x)  =  — - — r-  -  3  <  0.  The 

[J  EUj-M)2]2 

quantity  K(x)  might  be  referred  to  as  the  sample  kurtosis  about 
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the  median.  In  samples  of  size  20  from  a  normal  distribution,  K(x)  is 

negative  about  67  percent  of  the  time.  Thus  two-thirds  of  the  time  8* 

is  simply  the  sample  mean.  From  Figures  6a  and  7d  it  can  be  seen  that 
* 

A  =0  sometimes  for  samples  from  the  1WN  (16  percent  of  samples)  and 
slash  (4  percent)  as  well.  While  this  is  expected  for  the  1WN,  it  is 
at  first  alarming  for  the  slash.  Closer  inspection  shows  that  con¬ 
ditional  on  samples  from  the  slash  with  K(x)  <  0,  the  sample  mean  is 
certainly  worse  than  more  resistant  estimators,  but  not  disastrously 
so . 

A 

If  V'(A)  is  piecewise  continuous,  then 

(7.7)  A*  =  inf{A  >0  :  V'  (X)  >  0) 

where 

EX  y±\p  (Xy±)^  '  (Xy±)  -Zip  (AyJ  2 

Zip  (Ay.)2  Z  Ay.r  (Ay^ 

(Ay.) 

Of  course,  the  positive  factor  before  the  brackets  in  (7.8)  is 

•k  a 

unnecessary  in  defining  A  .  The  function  Vf(A)  depends  on  the  second 
derivative  of  ip .  Because  of  the  discontinuity  of  at  x=l,  V'(X) 
has  positive  jumps  at  X  =  1/y^,  for  i=l,2,...,n.  These  jumps  often 

•k 

cause  X  to  equal  1/y^  for  one  of  the  larger  values  of  y^  even  when 

A. 

the  general  trend  of  V(A)  is  downward  in  that  neighborhood.  The 
result  is  that  9  does  much  worse  for  the  long  tailed  slash  than  it 


(7.8)  V'(A) 


2n 


X3  EE  ip'  (Xy±)  ]2 
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would  if  ip  were  smoother.  For  that  reason,  as  well  as  for  mathemati¬ 
cal  considerations,  we  will  limit  further  considerations  to  functions 
ip  with  several  derivatives  everywhere.  There  does  not  appear  to  be 
any  practical  loss  in  doing  so.  The  infinitely  differentiable  func¬ 
tions  used  in  the  Monte  Carlo  study  are 


(7.9) 


(x)  =  x 


1  + 


X 


2P  -1 J 


“P 


for 


P 


> 


1 

2 


and 


(7.10) 


^ot(x)  =  lim  (x)  =  xe 
p-x»  p 


-(x  /2) 


The  maximum  of  \p^  always  occurs  at  x  =  1.  As  p  increases,  l p^ 
redescends  more  quickly  with  \pm  resembling  the  bisquare  (see 
Figure  8).  Monte  Carlo  results  are  reported  for  p  =  1.5,  2.0,  3.0, 
and  00 . 

j u* 

A  simple  algorithm  is  sufficient  to  compute  A  : 

1.  Check  if  K(x)  <0;  if  so.  A*  =  0  and  9  =  x. 


~  -1  -1 

2.  If  not,  calculate  V'(A)  for  A  =  y  r  ,  yr  . , 

LnJ  Ln-lJ 


until  it 


becomes  positive. 

3.  Do  a  binary  search  between  the  last  two  values  of  A  until  A 

is  known  within  acceptable  tolerance  limits. 

* 

4.  Approximate  A  by  linear  interpolation. 

Although  steps  2  and  3  are  not  guaranteed  to  find  the  first  time 

A  k 

V' (A)  becomes  positive,  the  algorithm  rarely  misses  A  by  much.  In 
fact,  for  large  n  the  cost  might  be  cut  safely  by  skipping  some  of  the 
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evaluations  in  Step  2.  Because  of  the  complexity  of  V"(A)  relative  to 
V’(A),  gradient  methods  do  not  seem  to  offer  much  of  an  improvement 
over  the  above  algorithm.  The  listing  of  a  FORTRAN  program  which 
finds  X  (with  a  modification  discussed  in  the  next  section)  and  0i 
appears  in  Appendix  A. 


(1) 


8.  Modification  of  A 

& 

The  choice  of  A  defined  by  (7.7)  leads  to  an  M-estimator 
which  performs  poorly  on  small  samples  from  short  tailed 

A 

distributions.  For  n  =  20  the  absolute  efficiencies  of  6 ^  on  the 

normal  and  1WN  are  at  most  0.85  and  0.80,  respectively,  while  we 

would  like  each  to  be  at  least  0.90.  Conversely,  the  efficiency  on 

the  slash  is  excellent.  This  evidence  suggests  that  A  tends  to  be 

too  large  on  average.  The  simplest  attempt  to  correct  this  problem, 

*  * 

replacing  A  with  cA  for  some  c  <  1,  does  not  work.  While  decreasing 

A  £ 

c  improves  the  performance  of  9^  on  the  normal,  the  simultaneous 
loss  of  efficiency  on  the  slash  is  too  great.  Furthermore,  no  choice 
of  c  adds  more  than  2  or  3  percent  to  the  efficiency  on  the  1WN. 

A 

The  graphs  of  V(A)  help  to  explain  why  the  last  idea  is 

< 

unsuccessful.  Most  of  the  time  A  is  fairly  close  to  Aq,  but 

A 

occasionally  V(A)  has  a  continual  downward  trend  very  far  past  Ag. 

For  ^,3  this  phenomenon  is  illustrated  in  Figures  5f,  6f,  7c,  and  7e. 
Although  a  local  minimum  sometimes  occurs  before  the  general  downward 

trend  terminates,  the  presence  of  such  a  minimum  is  very  sensitive  to 

A  ^ 
small  changes  in  V.  When  no  local  minimum  occurs,  A  is  too  large  to 
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be  satisfactorily  corrected  by  a  multiplicative  factor  without  over¬ 
compensating  on  other  samples. 

This  problem  is  more  pronounced  when  \p^  q  is  used.  Figure  9 
shows  V(A;  ip^  for  the  same  1WN  samples  used  for  Figure  6.  The 
horizontal  axis  has  been  rescaled  to  account  for  the  different  scal¬ 
ing  of  \py  q  and  .  Because  ip^  q  is  smoother  and  returns  to  zero 

A 

faster  than  \p,  g,  these  graphs  of  V(A)  are  much  less  volatile  than 

A 

those  in  Figure  6.  In  particular,  V(A)  is  often  very  flat  for  moder- 

*  A 

ate  sized  A.  While  A  =  0.15/MAD  in  9d,  a  slight  change  in  V  could 
move  A  to  1.0/MAD  or  more.  It  is  easier  to  understand  the  behavior 

ft  ^  ft 

of  A  by  viewing  graphs  of  V'(A)  since  A  is  the  point  at  which  the 
graph  first  breaks  above  the  horizontal  axis.  The  lower  function  in 

/S 

each  graph  of  Figure  10  is  V  (A;  ip^  q)  for  these  same  1WN  samples. 

/\ 

In  Figure  lOe  V  (A)  bends  away  from  zero  enough  so  that  it  doesn't 

✓N 

become  positive  until  almost  2Aq.  In  Figure  lOf  V'(A)  stays  below 

zero  until  A  is  very  much  too  large.  A  method  is  needed  which  sub- 

* 

stantially  reduces  A  in  these  extreme  cases  but  has  little  effect  on 
* 

A  the  rest  of  the  time. 

The  best  explanation  for  why  V'(A)  sometimes  lingers  below 

zero  for  so  long  seems  to  involve  l/n  E  Ay^  (Ay^)  .  While  each  sam- 
/\ 

pie  average  in  V(A)  converges  to  the  corresponding  expectation  in 
V(A)  ,  the  convergence  of  l/n  E  Ay_^"(Ay^)  appears  to  be  the  slowest. 
This  is  because  the  function  x  ip”(x),  shown  in  Figure  11,  has  a  rela¬ 
tively  large  total  variation  in  comparison  to  the  other  functions  in 
V'(A).  If  a  larger  number  of  y^  than  expected  are  in  the  range  where 
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Figure  10 

Graphs  of  v'(A)  and  v'(A)  +  gn(A)  Versus  A 'MAD 
Using  \p  for  Random  Samples  of  Size  20 

j  •  u 

From  the  One  Wild  Normal 

(a)  (b) 


Ay  tp"  (Ay  )  is  maximized  and  fewer  than  expected  where  it  is 
minimized,  then  —  £  Xy.  tf>"(Ay.)  will  overestimate 

E  X(X-  0)4»M  [X (X  -  0)  ] .  In  this  case  V’(X)  may  remain  negative  well 

past  A  The  estimator  is  improved  substantially  by  replacing 

^  £  Xy±  'K'(Ay^)  in  (7.9)  with  ^  £  [Xy^^  ^"(Xyi>  -  (^(Ay^2  'KAy^2], 

where  c^  is  a  constant  depending  on  the  sample  size  n.  The  function 
2  2 

x  tp(x)  is  also  shown  in  Figure  11. 

* 

We  redefine  X  by 

(8.1)  X*  =  inf(X  >  0  :  V'(X)  +g  (X)  >  0} 

n 


where 


(8.2) 


"  Cn 


2n  Z  iKAy^2  E  (Ay.)2  ^(Xy.)2 
X3  [Z  r  (Ay.)]3 


Figure  10  shows  some  examples  of  the  effect  of  g^(A)  on  X  for  sam- 

A 

pies  from  the  1WN.  In  each  graph  the  lower  function  is  V'(A)  and  the 
/\ 

upper  one  is  V'(X)  +  gn(A)  for  C£q  =  1.0.  In  the  first  four  samples 

X  given  by  (7.7)  is  either  less  than  Aq  or  very  near  it.  In  each  of 

these  cases,  the  presence  of  g  (X)  has  almost  no  effect  on  X  . 

* 

However,  when  X  given  by  (7.7)  is  much  larger  than  Aq,  the  effect  of 

using  (8.1)  is  substantial.  In  particular,  it  is  by  far  the  greatest 

in  situations  like  that  of  10f.  The  value  of  c  can  be  used  to  tune 

n 

the  adaptive  estimator.  Increasing  c^  improves  the  estimator  for  the 
normal  and  1WN  at  the  cost  of  increased  variance  for  the  slash.  For 
sample  size  20  a  value  of  C2Q  of  about  1.0  gives  a  reasonable  balance 
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between  short  and  long  tailed  distributions.  If  c  +  0  as  n  + 

n 

then  the  effect  of  is  asymptotically  negligible.  Also  the  pre- 
sence  of  g^(A)  does  not  affect  whether  or  not  A  =  0  since 
g  (A)  =  O(A^)  as  A  +  0. 

* 

One  final  modification  has  been  made  to  the  definition  of  A  . 

Despite  utilization  of  the  function  gn(A),  occasionally  the  procedure 

* 

breaks  down  to  the  extent  that  A  is  greater  than  the  reciprocal  of 

the  sample  MAD.  To  avoid  ever  having  more  than  half  of  the  data 

points  fall  on  the  redescending  part  of  the  influence  curve,  we  make 
* 

the  restriction  A  1/MAD.  This  restriction  effects  less  than  one 
percent  of  samples  of  size  20. 


9.  Asymptotic  Optimality  of  A 

In  this  section  we  show  that  for  most  common  symmetric 
distributions,  A^  is  asymptotically  equivalent  to  the  best  scale 
factor  Aq.  Under  certain  restrictions — mainly  on  the  smoothness  of 
ip — Theorems  2  and  4  establish  the  consistency  and  asymptotic  normal- 
ity  of  A  ,  when  Aq  is  positive.  The  asymptotic  normality  of  0^,  with 
best  possible  limiting  variance  using  ip,  is  proved  in  Theorem  5. 
Finally  when  F  is  normal,  Theorems  6  and  7  establish  the  asymptotic 
efficiency  of  0^.  First,  however,  we  show  that  0  and  0^^  are  loca¬ 
tion  and  scale  equivariant. 


&  A  rC 

Theorem  1.  For  A  given  by  either  (7.7)  or  (8.1)  the  M-estimator  0 


and  one-step  M-estimator  0^  are  location  and  scale  equivariant. 
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Proof .  We  only  need  to  show  that  for  a  >  0,  A  (ax+bl)  =  A*(x)/a. 

We  have  immediately  from  (7.8)  that  v'(A;  ax  +  bl)  =  v'(A;  ax)  = 
a  V  (aA;  x)  .  Thus  the  infimum  of  A  such  that  V * (A ;  x)  >  0  is  exactly 
a  times  that  for  V  (A;  ax+bl),  and  A  (ax+bl)  =  A*(x)/a  for  A* 
defined  by  (7.7). 

3 

Since  g  (A;  ax  +  bl)  =  a  g  (aA;  x)  ,  it  follows  that  the 
n  ~  ~  n  ~ 

theorem  is  also  true  for  A*  given  by  (8.1).  [] 

In  Theorem  2  we  establish  conditions  under  which  A^  converges 
almost  surely  to  the  value  of  A  which  minimizes  V(A;  ip,F) — subject  to 
the  requirement  that  the  first  relative  minimum  of  V(A)  is  also  the 
absolute  minimum.  Fortunately,  this  restriction  seems  to  be  unimpor¬ 
tant  in  practice.  Conditions  (i) ,  (ii) ,  and  (iii)  on  v’(A)  and  V"(A) 
are  necessary  to  eliminate  certain  pathological  cases  from 
consideration.  Condition  (iv) ,  which  requires  a  moment  for  the  third 
derivative  of  xjj,  is  probably  stronger  than  necessary.  However,  the 
class  of  functions  satisfying  (iv)  apparently  includes  an  ample 
choice  of  shapes  for  ip(e.g.,  given  by  (7.9)).  Furthermore,  the 
results  in  the  next  section  suggest  that  a  very  smooth  function  ip  is 
preferable  when  A  is  chosen  adaptively.  Conditions  (v)  and  (vi) 
appear  to  be  satisfied  easily. 

The  property  to  follow  is  used  in  Theorems  2,  4,  and  5.  Let 
h(A,q;x)  be  a  real  valued  function  for  A  e  R+  and  q,xeR.  For 
k  =  0,l,...  and  AcR  define  h(A,q;x)  to  be  in  the  set  &^(A)  jif  there 
exists  some  4  >  0  such  that 
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(9.1)  E  sup  - r  h(X,q;X)  <  00  ,  for  0  £  j  £  k  +  2 

XeA,  |q-0|<£  9X"1 

aj+l 

(9.2)  E  sup  - ; -  h(X,q;X)  <  00  ,  for  0  £  j  £  k  +  1 

XeA,  |q-0|<Z;  9X~*  9q 

and 

aj+2 

(9.3)  E  sup  - - - j  h(X,q;X)  <  00  ,  for  0  £  j  £  k 

XeA,  |  q— 0  |<£  9X-1  9q 

Theorem  2.  For  symmetric  F,  let  \p  and  Xq  >  0  be  such  that  the  condi- 

* 

tions  to  follow  hold.  Then  X  defined  in  (7.7)  converges  to  X_ 

n  u 

almost  surely  as  n  -*■  00 . 

(i)  V'(X0;  t|»,F)  =  0  and  V"(X0;  ^,F)  >  0. 

(ii)  For  all  6  >  0,  sup  V'(X;  i^,F)  <  0. 

6<X<Xq— 6 

(iii)  If  E(X-9)2  <  «,  then  [E(X- 6)  2] 2  <  1/3  E(X  -  0)4  £  +  ». 

2 

(iv)  For  all  6  >  0,  h^(X,q;  x)  =  lp[X(x-q)]  and  ti2(X,q;  x)  = 
ip*  [X(x  —  q)]  are  elements  of  [(6  ,°°)  ] . 

4 

(v)  There  exists  a  constant  T  >  0,  such  that  ip(x)  and 

O 

[max(xi{>"(x)  ,  0)]  are  less  than  or  equal  to  x*^(x)[^(x)  -xip'(x)]  for 
all  x. 

(vi)  M  converges  almost  surely  to  0. 

iflc 

Proof.  Since  X*  is  location  invariant  (i.e.,  X*(x  +  b*l)  =  X  (x)), 
we  may  assume  that  9=0. 
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The  proof  consists  of  showing  that  for  any  e  >  0,  there 
exists  a  6  >  0  such  that  with  probability  one,  each  of  the  following 
events  occurs  only  finitely  often: 

(a)  A*  >  X0  +  e. 

(b)  6  <  A*  <  Xfl  -  e. 

(c)  0  <  X  <6. 

—  n  — 

All  subsequent  statements  of  convergence  of  random  variables 
or  events  refer  to  almost  sure  convergence  as  n  -*• 

^  I 

We  note  that  for  X  £  X^,  Z  ip  (Xy±)  >  0.  Otherwise  the  con¬ 
tinuity  of  ip’  would  imply  that  V(Z)  was  unbounded  for  l  <  X  and  that 
A*  <  X.  Since  Z  ^'(Xy^)  >  0  for  X  £  X^,  we  can  write 

(9.4)  X  =  inf{X  >  0  :  W  (X)  >  0} 

n  n 

where 


(9.5)  W  (X)  = 
n 


£  2  iK(Xy.) 


V'(X) 


2X 


=  -J  2  r(X y,)  i  2  ^(Xy,)  [Xy,  r(Xy<)  -^(Xy,)l 


i  n 


-  I  ^  ^(Xy.)2  ^  Z  Xy.  f(A y.)}  . 

Suppose  that  h  is  any  of  the  functions  which  are  summed  in 
(9.5) — i.e.,  h(t)  =  ^(t)2,  \p( t)  [ttjj'(t)  -ty(t)],  etc*  For  fixed  X  >  0 
and  sufficiently  small  X,  >  0,  condition  (iv)  implies  that 
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E  sup  1 3 /3q  h[A(X-q)]|  <  00 .  By  the  strong  law  of  large  numbers  we 

hte 

have  that  lim  1/n  Z  sup  1 3 /3 q  h[A(x.  -  q)  ]  |  <  00  almost  surely,  which 
n-*»  |  q  |  <C  1 

implies  that  lim  1/n  Z  sup  |h[A(x.  -  q)  ]  -h(Ax.)|  -*  0  as  £  -*■  0. 

ir*»  |q|<?  1  1 

Since  [m^J  ->-0  =  0  as  n  +  application  of  the  strong  law  of  large 

numbers  to  1/n  ZMAx^  gives  us  that  1/n  Eh(Ay±)  -*■  Eh(AX)  .  Thus  for 
any  fixed  A  >  0  such  that  E  ip’  (AX)  >  0,  we  have 


(9.6) 


m’jpj.3  v.a) 


(a)  Let  £  >  0  be  fixed.  By  (i)  there  exists  some  A,  A^  A  £ 

An  +  £,  such  that  v'(A)  >  0  and  E  \p  (AX)  >  0.  Thus  W  (A)  is  eventu- 
u  n 


ally  positive,  and  A^  is  eventually  less  than  Aq  +  £. 


(b)  Let  6  >  0  be  fixed  and  define  5  - 


sup 

(S<A<Aq-£ 


[E  t|/(AX)]3  V'(A) 
2A 


By  (ii)  5  <  0,  so  that  for  any  finite  set  (A^  such  that  6  =  A.^  < 

A„  <  ...  <  A  =  A»  -  e,  sup  W  (A.)  <  ^/2  <  0  for  sufficiently  large 

l<i<k  n  1 

n.  Condition  (iv)  and  the  SLLN  imply  there  is  a  constant  C  such  that 

|W  (A)  I  _<  C  for  all  A  6 .  Thus  if  max  (A .  -  -X.)  <  £/C,  then 
n  l<i<k-l  11  1 


lim  sup  W  (A.)  <  5/2  implies  lim  sup  W  (A)  <  0,  which  implies 
n-*00  l<i<k  n  1  n-*»  6<A<Aq-£  n 

A*  is  in  the  interval  (6,  X^-e)  only  finitely  often. 

Note  that  (a)  and  (b)  imply  that  there  is  at  least  one  root 

of  v'(A)  =  0  consistent  for  A„.  We  will  proceed  to  show  that  there 
n  u 

are  no  roots  in  a  small  neighborhood  of  zero. 

(c)  We  note  that  W^(A)  may  be  written  as 


(9.7)  W  (A)=i  Z  a(A;y  )i  Z  b(A;y  )  -£  Z  c(A;y.)£  Z  d(A;y.)  , 

n  n  i=l  1  n  i=l  1  n  i=l  1  n  i=l 
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where 


(9.8) 

a(A;y)  =  ip(Ay)2/A2  , 

(9.9) 

b(A;y)  =  -Ay  ^"(Ay)/A2  , 

(9.10) 

c(A;y)  =  r(Ay)  , 

and 

(9.11)  d(l;y)  -  Mad.  flfehk  . 

A  X3 


We  need  to  show  that  there  exists  some  5  >  0  such  that  for  n  suffi¬ 
ciently  large,  sup  W  (A)  <  0. 

0<A<6  n  , 

Let  p  >  0  be  arbitrarily  small.  Since  ^'(0)  =  1  and  i jj  is 

continuous  at  zero  and  bounded  below,  there  exists  some  6  >  0  such 

that  lim  i”  c(A;  y±)  >_  1  -  p  uniformly  for  A  <_  6 .  Thus  since  (v) 
n-x»  n  1 

implies  d(A;y)  0,  we  have  that 


(9.12) 


W  (A) 
n 


<  -  l  a(A;y.)  ^  l  b(A;y.)-(l-p)  ;H(A;y  ) 

-  n  i=l  1  n  i=i  1  n  1 


n 


for  A  £  6  and  n  sufficiently  large. 

At  this  point  we  separate  each  of  the  summations  in  (9.12) 
into  two  parts  according  to  whether  y^  <_  B  (for  a  large  positive 
number  B) .  Along  with  the  condition  on  6  from  the  previous  paragraph, 
we  require  that  6  £  T)/B  where  r)  >  0  and  B  are  constants  to  be  deter¬ 
mined  later.  Thus  for  sufficiently  large  n 
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(9.13)  Wn(X)  < 


£  2  a(X;y)+£  E  a(X;y.) 

y.<B  y.>B 

l—  1 


~  S  b(X;y)+^  Z  b(X;y.) 


y  ,<B 

l— 


y.>B 

i 


-  (1-P) 


~  2  d(X;y)+^  E  d(X;y.) 

y.<B  n  y±>B  1 


Since  <_  B  implies  that  Xy^  fiB  <_  T| ,  Taylor  expansions  of  the 


terms  in  (9.13)  for  y^  £  B  give 


(9.14)  W  (X)  < 
n  — 


[1  +  0(n2)  ]-  2  y2  +  -  2  a(X;y.) 

n  i  n  J i 

y±<B  y±>B 


H'n  (0)  +0(n  )]-r  2  yf  +  —  2  b (X ;y . ) 

n  ^  ti  l  t  ^  i 


y.<B 

i— 


y.>B 


-  (1-P) 


[-  k  ^’"(0)  +°(P2)]-  E  y4+-  Z  d(X;y.) 


n  l  n  '  "  1' 

y±<B  y^B 


for  sufficiently  large  n.  Remember  that  ty"' (0)  <  0. 

2 

Finally  we  must  consider  separately  the  cases  EX  <  00  and 
2  .  2 

EX  —  +  00 .  First  suppose  EX  <  °°.  Then  for  sufficiently  large  n,  we 
have 


(9.15)  sup  W  (X)  < 


0<X<6 


n 


[1  +0(n2)  2  y2  +  —  2  y2 

y±<B  1  y±>B  1 


hr*  (o)  i 


[1  +0(n2)  ]-  2  y2  +  —  2  y2 

n  y.<B  1  n  y.>B  X 
i—  i 


-  (1-p)  [-  i  if;'"  (0)  +0(n2)]^  2  y4  . 


n  y.<B  1 

i— 
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For  fixed  B,  the  same  reasoning  which  leads  to  (9.6)  implies  that  the 
summations  1/n  E 


2  2  A 

Jy  <B  yi’  1/n  Ey  >B  yi’  atld  1/n  Ey  <B  yi  COnVerge  to 
i—  i  ir~ 


2  2  A 

/ J x | <b  x  dF(x)»  / 1 x | >b  x  dF(x)»  and  ^|X|<B  x  dF(x)  respectively  as 

~  2  4 

n  -*■  00 .  As  B  00 ,  the  three  integrals  converge  to  EX  ,  0,  and  EX  . 

Finally  since  1/3  EX^  >  (EX2)2,  choosing  p  and  n  sufficiently  small, 

and  B  and  n  sufficiently  large,  implies  that  sup  W  (A)  <  0. 


0<A«S 


n 


Now  suppose  that  EX  =  +  °°.  By  reasoning  similar  to  that 

above  we  can  choose  p,  q,  and  B^  in  such  a  way  that  if  B  >.  B^,  then 

2 

there  exists  some  y  >  0  for  which  E  a(A:y.)  <  y  I  _  y.  and 

y.>B  —  1  y.<B  i 


"y.>B 

l 


b(A;y.)  £  Y  E  y.  implies  sup  W  (A)  is  eventually 


y.<B  i 
i— 


0<A<6 


n 


negative.  Thus  we  only  need  to  worry  about  those  n  and  A  <  6  such 

that  one  of  the  inequalities  fails  to  hold. 

Suppose,  for  example,  that  E  >g  a(A;y^)  >  y  E  y.  .  Let 

yi  b  1  yi—  1 

the  random  variable  p^  equal  the  number  of  y^  >  B.  Then  (v)  implies 

that  1/n  E  >B  d(A;y.)  >  (p^)  1/p  E  >B  a(A;y.)2  > 

i  J  l 

P_/t  [l/p  2  a(A;y  )]2  =  n/xp  [1/n  E  a(A;y  )]2.  It  is  possi- 
n  n  n  y.^o  1  n  y  .^d  i 

1  ;i 

ble  to  choose  B  large  enough  so  that  P(|x^|  >  B)  <_y/6x.  In  that  case 
p^/n  is  eventually  less  than  y/3x  which  implies  that 


y  >B  y 

l 


—  £  a(A;y.) 

n  y.>B 

i 


S  y2  ^  E  a(A;y  )  . 

y.<B  y  >B 

x—  J  i 


Similarly  we  get  that  ^  E  >g  d(A;y  )  >  ^  B  E  >B  b(A;y  ) 

J i  J±—  J1 

and  Ey  >B  d(A;yi}  -  y  n  Ey  >B  a(A;yi}  n  Ey  >B  b(A;yi} *  SinCe 

l  .1  l 
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these  inequalities  hold  for  all  Xe  (0,6)  (for  large  enough  n)  ,  we 

have  sup  W  (X)  >  0  only  finitely  often  and  thus  X  <  6  only 
0<X<5  n  “  n“ 

finitely  often.  jj 

-  —  A  ^ 

The  desired  consequence  of  Theorem  2  is  that  /n(9  -  0)  has  a 

n 

limiting  normal  distribution  with  variance  V(XqJ  F)  ,  the  best  pos¬ 
sible  asymptotic  variance  using  \[).  Before  proving  that  result,  as 

Theorem  5,  we  need  to  study  the  rate  of  convergence  of  X  to  X„.  In 

n  0 

Theorem  4  we  show,  under  slightly  more  restrictive  conditions  than  in 

Theorem  2,  that  /n(X  -  X^)  has  a  limiting  normal  distribution.  With 

the  use  of  this  intermediate  result,  we  are  able  to  obtain  the  asymp- 

* 

totic  optimality  of  Xn  as  a  scale  factor  when  X^  >  0.  The  most 
interesting  case  for  which  X^  =  0  occurs  when  F  is  the  normal  distri¬ 
bution  function.  Theorems  6  and  7  give  the  asymptotic  optimality  of 

&  k 

X  and  0  when  F  is  normal, 
n  n 

Theorems  4-7  share  several  common  elements.  In  each  theorem 

we  study  the  limiting  behavior  of  the  root  of  an  implicit  equation  in 

the  presence  of  a  nuisance  parameter  which  converges  to  a  constant  in 

probability.  In  Theorems  5  and  7,  where  the  implicit  equation  is  the 

defining  equation  for  an  M-estimate,  the  results  are  rather  familiar. 

* 

Because  the  defining  equation  for  X  is  more  complex.  Theorems  4  and 

n 

6  require  a  more  general  framework.  Theorem  3  is  a  central  limit 
theorem  in  this  more  general  setup.  No  attempt  has  been  made  to  find 
the  most  general  conditions  under  which  the  theorem  is  true.  As  a 
consequence.  Theorems  4-7  may  call  for  more  derivatives  of  \p  than  are 
absolutely  necessary.  However,  for  the  same  reasons  that  precede  the 
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statement  of  Theorem  2,  this  limitation  should  have  little 
significance. 

Theorem  3  requires  the  following  definitions.  As  before, 
suppose  that  x^,...,xn  is  a  random  sample  from  F,  and  let  be  a 
real  valued  function  of  x  such  that  cln(x)  converges  in  probability  to 
the  constant  q^.  Given  the  real  valued  functions  a_.(t,q;x),  for 
j  = 1, ... ,4;  let 


(9.16) 


1  n  n 

U  (t)  =  —  £  a  (t,q  ;x  )  —  £  a  (t,q  ;x  ) 

n  n  .  ..  I  n  l  n  .  ,  2  n  l 

1=1  i=l 


1  n  1  n 

+  —  £  a„(t,q  ;x.)  —  E  a  (t,q  ;x.) 

n  .  ..  3  nn  l  n  .  t  4  Hn  l 

1=1  i=l 


and 


(9.17)  U(t)  =  Ea1(t,q0;X)Ea2(t,q0;X)  +Ea3(t,q0;X)Ea4(t,q0;X) 

Lemma  1  gives  the  limiting  distribution  of  T  ,  a  consistent  root  of 

n 

the  equation, 


(9.18) 


U  (T  )  =  0  . 

n  n 


Finally,  define 

(9.19)  a2  =  a1  Da 


where 


and  D  is  the  covariance  matrix  of  (a^t^q^X)  ,  a2(t0,q0;X), 
a3(t0’q0;X)’  a4(t0>q0;X))* 

Theorem  3.  Under  the  conditions  C1-C6,  there  exists  a  consistent 
sequence  of  eventually  unique  roots  of  (9.18)  converging  to  tQ  as 
n  +  and  >^(Tn  -  tQ)  N(0,a2/[U'  (tQ)  ]2)  . 


(Cl)  U(tQ)  =  0. 

(C2)  U'(t0)  ^  0. 

(C3)  qn  =  q0  +  Op(n^). 

(C4)  E  3/3q  a_.(t0,q0;X)  =  0  ,  for  j  =  l,...,4. 

(C5)  There  exist  neighborhoods  T  of  t^  and  Q  of  q^  such  that 

9  9  9 

E  sup | 3  /3t  a.(t,q;X)|,  E  sup|3  /3t3q  a.(t,q,X)|,  and 

T,Q  3  T,Q  3 

2  2 

E  sup | 3  /3q  a. (t,q;X) |  are  finite  for  j  = 1, ... ,4. 

T,Q  3 


(C6)  The  covariance  matrix  D  is  finite. 


Proof.  The  first  step  is  to  show  that  /n  U  (t„)  has  a  limiting  nor- 
-  n  U 

mal  distribution.  Expanding  a^t^q^x^)  in  a  Taylor  series  about 
qn  =  q0  gives 


(9 


21>  n  *nIf,3(t0’<l0;!tl>+<Vq0>  nZ^a3(t0’q0;V 


.  1,  ,  2  1  „  3  .  .  . 

*  2< Vo1 

3q 
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for  some  r  such  that  |r  -  q  J  <  |q  -  q  I  .  Since 
n  ’  n  0 1  —  1  n  0 1 


sup 


l/n  S  32/3q2  a . ( t  , r  ;x.) 

J  0  n  x 


<  l/n  Z 


sup 

Iv’ohlv’o1 


i  2  2 

I  3  /3q  a..  (t0,rnJXi)|  ,  (C3)  ,  (C5)  ,  and  the 


2  2 

law  of  large  numbers  imply  that  l/n  E  3  /9q  a.(tA,r  ;x.)  =  0  (1)  as 

3  0  n  l  p 

n  -*  00 .  Using  C3  again  implies  that  the  last  term  of  (9.21)  is 
—1 

Op(n  )  =  °p(n  2) •  By  C3,  C5,  and  the  law  of  large  numbers  the 

second  term  is  also  o  (n  2)  .  Thus  the  multivariate  central  limit 

P 

theorem  implies  that 

r 

1 

(9.22)  /n 


n  E  WVV"®  VVV® 


^  J  ‘AVi1-1  WVX) 


& 


*  n4(0,d)  . 


Condition  Cl  now  implies  that 


(9.23) 


Sn  un(t0)  N(09a2)  . 


We  next  consider  the  behavior  of  U?(t)  in  T,  the  neighborhood 


n 


of  tg  mentioned  in  C5.  We  can  write 


<9-24>  ;!sV‘Vi»  - ;  *  &  V'-vV 


+  (’n-V  n'skV1’1.11!1 


for  some  new  value  of  r  such  that  |r  -qJ<|q  -  q  J  .  For  tcT,  the 

n  '  n  n0‘— '  n  n0' 

second  term  on  the  right  is  0p(n  2) ,  so  that  a  further  Taylor  expan¬ 
sion  about  t  =  t^  gives 
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(9'25)  n  X  Ji  V'-W  '  n  E  £  aJ(t0>‘I0:xi) 


+  (t'V  E  ^2  VWV  +  Op'1'"*') 


for  some  v^eT.  Near  repetition  of  the  steps  following  (9.21) 
implies  that 


(9’26)  n  Z  it  aj(t’qn;xi}  =  E  ft  aj(t0’q0;X)  +  0(lt-t0l)  +  ’ 

and  consequently 


(9.27)  ^(t)  =  U(tQ)  +  0(|  t  -  tQ|)  +  Op(l)  , 

where  0 ( J  t  —  ] )  does  not  depend  on  n. 

Thus  for  any  e  >  0,  there  exists  some  6  >  0  such  that 
1 1  -  tg |  <  6  implies  that 

(9.28)  P(|u^(t)  -u’(tg)  j  >  e)  +  0 


as  n  -*■  00 .  By  choosing  e  <  |u’(tg)|,  we  can  find  an  interval  around 
tg  such  that  the  probability  of  multiple  roots  converges  to  zero. 
Furthermore,  (9.23)  insures  that  the  probability  of  there  being  a 
root  in  the  interval  converges  to  one.  Finally,  (9.28)  implies  that 
for  any  e  >  0 


(9.29) 


1  - 


|U’(tg)|  - 


<  - 


w 


<  1  + 


]u'(t0) 


as  n  -*■  00 .  Thus  /n  (T^  -  tg)  must  have  the  same  limiting  distribution 

aS  Un(tg)/U'(tg).  0 
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Theorem  3  reduces  the  proofs  of  Theorems  4-7  to  little  more 
than  condition  checking.  We  note  the  complementary  relationship 
between  t  and  q.  In  each  theorem  the  random  variable  of  primary 
interest  is  t.  In  Theorems  4  and  6,  when  t  is  the  scale  factor  (or 
its  square),  q  is  a  location  parameter — the  median.  In  contrast, 
when  t  is  a  location  estimate,  q  is  a  function  of  the  scale  factor. 

We  define  several  functions  and  constants  in  preparation  for 
Theorem  5.  Let 


(9.30) 

(9.31) 

(9.32) 

(9.33) 


Q1(x)  =  ty[AQ(x  -  0)  ]2 

Q2(x)  =  -Ao(x-0)^"[Xo(x-6)] 

Q3(x)  =  [Aq(x  -  0)  ] ,  and 

Q4(x)  =  i|j[Xo(x-0)]{Ao(x-0)^"[Xo(x-0)]  -  \J)|>0(x-  0)]}  • 


Then  define 


(9.34) 


where 

(9.35) 


a 


2 


Da 


a  =  (EQ2(X),  EQ1(X),  EQ4(X),  EQ3(X))t 


and  D  is  the  covariance  matrix  of  Q(X) .  Also  define 
(9.36)  oo  =  \  Xq{e^'  [Aq(X- 0)]}3  V"(A0)  . 


Theorem  4.  Suppose  that  ip,  F,  and  A^  satisfy  the  conditions  below  in 

•jjj 

addition  to  those  of  Theorem  2.  Then  for  An  given  by  (7.7) 
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(9.37) 


&  a*- x0)  n(°,  . 

ft) 

2 

(ivf)  Fpr  some  <5>0,  h^(X,q;x)  =  $[X(x-q)]  ,  h2(X,q;x)  = 

\jj'  [X(x  -  q)  ] ,  and  h3(X,q;x)  =^[X(x-q)]  are  elements  of 

J*1  [  (X0  -  6 ,  Xq+6)]. 

(vi')  M  =6  +  0  (n_Js)  . 
n  p 

(vil)  EQ_.  (X)  ^  <  00 ,  for  j  =  l,...,4. 

jUp  o 

Proof.  LetTn=Xn,  cq  =  XQ»  9n  =  Mn»  qO=0’  a1(t>95x)  =^[t(x-q)]  , 

a2(t,q;x)  =  -t(x  -  q)\p"  [t(x  -  q)  ],  a3(t,q;x)  =  Ip'  [t(x  -  q)  ] ,  and 

a^(t,q;x)  =  ip  [t(x  -  q)  ]{t(x  -  q)$'  [t(x  -  q)  ]  -xp  [t(x  -  q)  ]}  .  We  note  that 

U  (t)  =  1/2  t0{l/n  E  ip'  [t(x.  -M  )  ]}0  V'  (t)  ,  so  that  indeed  U  (T  )  =  0. 
n  x  n  n  n 

Also  U'(t  )  =  1/2  t^EiK  [tQ(X- 6)  ]}3  V"(t0),  so  that  U'(XQ)  =  ft).  Thus 
if  conditions  C1-C6  hold,  application  of  Theorem  3  gives  the  desired 
result. 

Cl  and  C2  each  follow  from  condition  (i)  of  Theorem  2.  C3  is 
implied  by  assumption  (vi').  C4  follows  from  the  fact  that,  for  each 
j,  3/3q  a^Ct^jq^jX)  has  a  symmetric  distribution  centered  at  zero. 
Finally,  C5  and  C6  are  both  assumed  in  the  hypothesis  of  the 
theorem.  [] 

Theorem  5.  Under  the  assumptions  of  Theorem  4,  there  exists  a  con- 
sistent  sequence  of  roots  6n  with  limiting  distribution 

(9.38)  (0* -0)  N(0,V(X0;if»,F))  . 
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Proof.  The  proof  follows  the  same  line  as  the  last  one  with  T  =  ft  . 

n  n 

£ 

fc0  ~  ® »  9^  ~  qQ  ~  ^1  ~  3. ^  —  1 »  and  a^  —  a^  —  0. 

Conditions  C1-C6  are  all  verified  easily. 


Theorem  6.  Let  F  be  a  normal  distribution  function. 


-h 

bounded  derivatives  on  the  real  line  and  M  =  0  (n  2) 

n  p 

A*  =  0  (n"S  . 

n  pv 


If  ip  has  four 
,  then 


Proof .  The  proof  relies  on  Theorem  3  and  some  reasoning  specific  to 

this  case.  Let  tQ  =  0,  qn  =  M^,  and  qQ  =  0 .  For  t  >  0,  let 

a~1  =  ^[t^Cx-q)  ]/t,  a2  =  -t^(x-q)^'*[t'S(x-q)]/t,  a3  =  jp '  [t^(x  -  q)  ] , 

and  a4  =  if>[t*(x-  q)  ]{t^(x  -  q)l/>’ [t^(x  -  q)  ]  -  ip  [t*(x  -  q)  ]} /t2 .  For 

t  =  0,  let  a^O.qjx)  =  (x-q)2,  a2(0,q;x)  =  -\p”’  (0)  (x  -  q) 2, 

a3(0,q;x)  =  1,  and  a^CO.qjx)  =  1/3  ip,n  (0)  (x  -  q)  4.  We  would  like  to 

set  T  =  (A  )  ,  but  it  is  not  always  true  that  U  [(A  )  ]  =  0.  That 
n  n  '  n  n 

*n« 

is  true  only  when  U  (0)  <0.  However,  when  U  (0)  >  0,  A  is  zero  and 

n  —  n  n 

the  conclusion  of  Theorem  6  is  trivial.  Thus  we  only  need  to  worry 

*  2 

about  the  times  that  T  =  (A  )  .  If  conditions  C1-C6  hold,  then  the 

n  n 

theorem  is  true. 

Expanding  U(t)  in  a  Taylor  series  about  t  =  0,  yields 


(9.39)  U(t)  =  -  ip"’  (0)  {  [E(X  -  0)  2]2  -  ~  E(X-0)4} 

+  t{^fy"'  (0)]2[E(X- 0)6  -  3E(X- 0)2  E(X-0)4] 

+  ^  ^(5)(O)[E(X-0)6 -5E(X-0)E(X-0)4]}  +  0(t2) 
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Since  the  kurtosis  of  the  normal  is  zero,  U(0)  =  0  which  implies  Cl. 

Because  the  consistent  root  must  be  positive  when  U  (0)  <0,  it  is 

n 

necessary  that  U' (0)  >  0  (rather  than  just  ^  0) .  For  the  normal 
U'(0)  =  1/3  [ip'"  (0)  ]2  [Var(X)  ]2  >  0,  so  that  C2  holds.  The  rest  of  the 
conditions  are  verified  easily  by  using  the  hypotheses  of  the 
theorem.  We  note  that  the  slightly  weaker  condition  on  the  deriva¬ 
tives  of  ifj  is  made  possible  by  the  fact  that  all  moments  of  the 
normal  distribution  are  finite.  Q 

Theorem  7 .  Under  the  conditions  of  Theorem  6, 

(9.40)  Jn  (9*-0)  >  N(0,  Var(x))  . 

Proof.  We  let  T  =  9*,  t„  =  0,  q.  =  (A*)2,  qrt  =  0,  a,  =  \jj  [q^(x  -  t)  ] , 
-  n  n  u  n  n  u  l 

a„  =  1,  and  a  =  a,  =  0.  Again,  C1-C6  are  verified  easily.  [] 
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10.  Monte  Carlo  Results 


In  this  section  we  present  results  demonstrating  the 

performance  of  the  adaptive  one-step  M-estimator  on  the  triefficiency 

sampling  situations  described  in  Section  2.  Numerous  preliminary 

Monte  Carlo  runs  were  done  in  the  developmental  stages  to  determine 

/\* 


the  exact  form  of  8^^  and  appropriate  parameter  values.  The  basic 
estimator  and  the  variations  which  are  presented  here  were  decided 
upon  before  the  final  Monte  Carlo  study  was  done.  Details  of  the 
design  of  the  study  may  be  found  in  Appendix  B. 

Table  1  contains  the  main  results  of  this  section.  The  two 
main  estimators  being  compared  are: 

1.  basic  adaptive  one-step  M-estimator,  0^,  given  by  (5.4) 

£ 

where  X  =  X  is  defined  by  (8.1).  The  influence  curve  is  ip^  Q(x) , 
and  C£q  =  1.0.  Other  functions  ip  and  values  of  c^q  are  used  in 
Tables  2  and  3  respectively. 

2.  nonadaptive  one-step  bisquare,  given  by  (5.4)  where 

X  =  1/(6.4"MAD).  This  estimator  (for  constants  7.4,  8.2,  and  9.0) 
is  used  in  a  Monte  Carlo  study  of  confidence  interval  robustness  by 
Gross  (1976).  The  constant  6.4  has  been  chosen  to  produce  fairly 
equal  performance  (in  terms  of  relative  efficiency  to  the  best  known 
estimators)  for  the  normal  and  slash.  The  efficiency  of  the  bisquare 
on  the  1WN  is  especially  good.  This  estimator  is  also  used  in  the 
triefficiency  study  of  P-estimators  by  Johns  (1979) .  We  shall  see 
that  the  particular  choice  of  the  constant  is  not  especially  impor¬ 


tant  since  the  value  of  c2Q  serves  to  tune  0^  in  practically  the 
same  manner. 
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TABLE  1 


TRIEFFICIENCY  OF  BASIC  ADAPTIVE  ONE-STEP 
M-ESTIMATOR  FOR  p=3.0,  c2Q  =  1.0,  n  =  20 


Normal  1WN  Slash 


Number  of  samples 

10, 

000 

20,000 

100, 

000 

Variance  of  0^.^x2G 

1.070 

(0.003) 

1.197  (0.003) 

6.172 

(0.025) 

Relative  efficiency 
to  "best  known" 

93.5 

(0.3) 

94.2 

92.7 

Relative  efficiency 
to  bisquare 

105.0 

(0.20) 

98.9  (0.14) 

103.5 

(0.17) 

Relative  efficiency 
to  nonadaptive 
one-step;  q, 

X  -  0.35/MAD 

103.1 

(0.20) 

98.3  (0.14) 

. 

101.8 

(0.17) 

The  second  line  of  Table  1  lists  n  =  20  times  the  variance  of 
the  basic  adaptive  estimator.  Standard  errors  of  the  estimated  vari¬ 
ances  appear  in  parentheses.  In  the  next  line  those  variances  are 
compared  with  the  best  known  variances  of  location  and  scale  equivar- 
iant  estimators.  For  the  normal  the  best  estimator  is  the  mean,  with 
variance  1.  For  the  other  two  situations,  the  true  best  estimator  is 
very  difficult  to  calculate.  For  the  1WN  we  have  used  the  variance 
1.127,  reported  in  the  Princeton  study  for  a  Hampel,  25A.  For  the 
slash,  the  best  variance  which  we  could  find  is  5.72  for  the  best 
one-step  bisquare  in  Figure  3.  The  relative  efficiency  of  0^  is 
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in  the  neighborhood  of  93-94  percent  for  each  of  the  three  diverse 
situations . 

A  more  realistic  test  of  a  new  estimator  is  comparison 

against  a  single  estimator  with  known  good  properties.  Relative 

.  .  . 

efficiencies  of  9^)  to  ^he  bisquare  are  given  in  line  four  of 

A  ^ 

Table  1.  These  show  that  substantially  outperforms  the  bisquare 

on  the  normal  (105.0)  and  slash  (103.5)  while  sacrificing  little  on 
the  1WN  (98.9).  This  is  typical  of  the  performance  of  the  adaptive 
estimator.  It  does  best  for  extreme  distributions,  where  the  loss 
from  compromising  is  the  greatest.  For  moderate  tailed  distributions, 
where  there  is  little  to  be  gained  by  adjusting  A,  the  adaptive  esi- 
mator  does  not  do  as  well. 

In  the  last  line  of  Table  1  0^)  is  compared  to  a  nonadaptive 
one-step  with  the  same  function  \p3  Q,  instead  of  ^bg.  While  the 
adaptive  estimator  still  seems  to  offer  an  improvement,  the  gain  is 
not  nearly  so  striking.  This  is  because  ^  uniformly  dominates  the 
bisquare  for  these  three  situations.  This  fact  strongly  suggests 
that  for  nonadaptive  M-estimators  an  influence  curve  which  asympto¬ 
tically  returns  to  zero  is  preferable  to  one  which  is  zero  for  all 
but  a  finite  interval  of  x  values. 

Tables  2,  3,  and  4  give  Monte  Carlo  results  for  modifications 
of  the  basic  adaptive  estimator.  This  allows  us  to  examine  how  cer¬ 
tain  parameters  affect  0 .  Each  of  these  results  is  based  on  a  20 
percent  subset  of  the  samples  used  to  create  Table  1.  By  using  a 
carefully  chosen  subset  of  samples,  we  have  reduced  the  standard 
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errors  of  most  of  the  entries  well  below  what  they  otherwise  would 
be.  Furthermore,  the  standard  error  of  the  difference  between  two 
entries  in  the  same  column  is  usually  smaller  than  either  standard 
error.  Thus  patterns  appearing  in  the  columns  of  the  tables  should 
closely  reflect  the  truth.  For  more  details  on  these  points,  see 
Appendix  B. 

In  Table  2  we  study  the  effect  on  the  triefficiency  of  0^ 
of  using  various  shaped  functions  if).  We  use  the  family  l p  in  (7.9) 

and  (7.10)  for  p  =  1.5,  2.0,  3.0,  and  00 .  Remember  that  as  p 

increases,  ij;  (x)  redescends  more  quickly  from  its  maximum  at  x  =  1.0. 

A* 

Although  the  value  of  p  has  some  effect  on  the  performance  of  0^^ 

(at  least  for  the  normal  and  slash) ,  the  adaptive  estimator  is  not 
overly  sensitive  to  changes  of  p  in  this  range.  However,  the 

TABLE  2 

RELATIVE  EFFICIENCY  OF  0*^  TO  THE  BISQUARE 
FOR  VARIOUS  FUNCTIONS  ip;  c2Q  =  1.0 


Normal 

1WN 

Slash 

,  P  =  1.5 

105.7 

(0.3) 

98.5 

(0.3) 

102.6 

(0.2) 

P 

*p 

,  P  =  2.0 

105.4 

(0.2) 

98.8 

(0.2) 

103.4 

(0.2) 

,  p  =  3 . 0 

105.0 

(0.2) 

98.9 

(0.14) 

103.5 

(0.17) 

P 

,  p  =  co 

104.4 

(0.4) 

98.8 

(0.2) 

102.2 

(0.3) 

104.7 

(0.5) 

100.0 

(0.4) 

87.9 

(0.5) 
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moderate  values — 2.0  and  3.0 — do  appear  to  be  slightly  superior  to 
either  p  =1.5  or  p  =  “. 

The  last  line  of  Table  2  shows  the  performance  of  0^  using 
ip  (rescaled  to  have  a  maximum  at  1.0).  The  relative  efficiencies 

DS 

to  the  bisquare  are  good  for  the  normal  and  1WN  but  disasterous  for 
the  slash.  The  apparent  reason  for  this  problem  is  that  is  dis¬ 
continuous  at  ±  /5  (for  this  scaling) .  This  topic  was  discussed  in 
Section  7. 

Table  3  demonstrates  the  effect  for  p  =  3.0  and  n  =  20  of  c^, 
which  appears  in  formula  (8.2)  for  gn(A).  The  relative  efficiencies 
in  the  first  line  (where  c ^  =  0.0)  correspond  to  defining  A  without 
the  function  gn(A) .  As  we  stated  in  Section  8,  the  results  are 

TABLE  3 

RELATIVE  EFFICIENCY  OF  9*^  TO  THE  BISQUARE 
FOR  VARIOUS  VALUES  OF  c2Q;  p=3.0 


Normal 

1WN 

Slash 

c20  '  °-° 

94.8 

(1.0) 

88.1 

(0.7) 

109.4 

(0.4) 

c20  '  °’9 

104.4 

(0.2) 

98.5 

(0.15) 

104.3 

(0.2) 

C20  =  1-° 

105.0 

(0.2) 

98.9 

(0.14) 

103.5 

(0.17) 

C20  =  1,1 

105.6 

(0.3) 

99.6 

(0.2) 

102.7 

(0.2) 

C20  =  1,3 

106.6 

(0.3) 

100.5 

(0.3) 

100.8 

(0.2) 
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unacceptable  for  the  normal  and  1WN.  Values  of  c2Q  in  the  interval 
(0.9,  1.3)  all  appear  to  be  reasonable  choices.  As  c2Q  increases  in 

A  A 

this  range,  the  efficiency  of  0^  improves  for  the  normal  and  1WN. 

At  the  same  time  the  efficiency  for  the  slash  decreases.  We  see  that 

^  * 

cn  can  be  used  to  tune  ®  *-°  a^ow  f°r  Prior  assessment  of  the 

likelihood  of  very  long  tails .  There  is  an  essential  relationship 

between  cn  and  the  scaling  of  \p.  Different  members  of  the  family 

^p(x)  gave  similar  results  for  a  single  value  of  c2Q  because  each  of 

these  functions  has  its  maximum  at  x  =  1.0.  If  Uj  is  rescaled  to 

P 

have  a  maximum  at  t,  then  c2Q  should  be  divided  by  t* . 

In  Table  4  we  present  results  of  an  ad  hoc  modification  which 
uniformly  improves  0*^  for  the  three  sampling  situations.  We  noted 

A 

in  Section  7  that  V(A)  is  least  reliable  as  an  estimate  of  V(A)  when 
1/n  £  ip' ( A  y^)  is  small.  The  modification  places  a  lower  bound  on 
1/n  £  \p’ (X  y  ) .  If  this  bound  is  reached  before  V'(A)  becomes 


TABLE  4 

RELATIVE  EFFICIENCY  OF  0*^  TO  THE  BISQUARE  FOR  VARIOUS 
LOWER  BOUNDS  ON  ^  £  ip’  (A  y±)  ;  p=  3.0,  c20  =1.0 


Normal 

1WN 

Slash 

^  £  \fj'(X  y±)  >  0.0  105.0  (0.2)  98.9  (0.14)  103.5  (0.17) 

^  £  Tp* (X  y±)  >  0.40  105.2  (0.2)  99.4  (0.2)  103.7  (0.2) 

^  £  \p’(\  y±)  >  0.45  105.6  (0.3)  100.0  (0.2)  103.1  (0.2) 
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positive,  X  is  set  equal  to  the  point  at  which  this  happens.  For  a 
lower  bound  of  0.40,  small  improvements  of  0.2  to  0.5  percent  are 
registered  for  each  of  the  triefficiency  situations.  Increasing  the 
bound  to  0.45,  produces  larger  gains  for  the  normal  and  1WN,  at  the 
expense  of  decreased  efficiency  for  the  slash.  Even  so,  the  estimated 
efficiencies  are  all  greater  than  or  equal  to  those  in  Table  3  for 
C20  =  ■*■'■*"*  Lower  bounds  much  larger  than  0.45,  however,  seem  to 
cause  too  much  harm  on  the  slash  to  be  an  appropriate  tuning  method. 

We  note  that  the  modification  discussed  in  this  paragraph  adds  only 
one  line  of  code  to  the  computer  program  which  finds  X  . 

To  this  point  our  Monte  Carlo  study  has  been  limited  to  sample 
size  n  =  20.  By  the  adaptive  nature  of  0^,  however,  the  sample  size 
should  be  an  important  factor  in  its  performance.  We  have  argued 
previously  that  as  n  increases,  the  performance  of  the  adaptive 

TABLE  5 

RELATIVE  EFFICIENCY  OF  0^  TO  THE  BISQUARE 
FOR  n  =  15  AND  n  =  40;  p  =  3.0 


Normal 

Wild 

Normal 

Slash 

n  =  15,  c^ j-  =1.15 

105.2  (0.3) 

97.7  (0.4) 

100.8  (0.4) 

n  =  15,  c15  =1.15,  ^Eip’CXyp  >0.40 

105.5  (0.3) 

98.3  (0.2) 

101.0  (0.4) 

n=  15,  c.  =1.15,  -Zip’ (Ay.)  >0.45 
ij  n  l  — 

106.1  (0.3) 

99.1  (0.2) 

100.4  (0.3) 

n  =  40,  c4Q  =0.80  (2WN) 

106.6  (0.3) 

99 .8  (0.3) 

106.1  (0.6) 
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estimator  should  improve  relative  to  that  of  the  nonadaptive  one.  We 
illustrate  this  quantitatively  in. Table  5  for  n  =  15  (about  the 

/N  3®* 

smallest  n  for  which  0^  might  prudently  be  used)  and  n  =  40.  For 
n  =  15,  the  1WN  consists  of  14  standard  normals  and  one  N( 0,100)  in 
every  sample.  For  n  =  40,  we  used  a  "two  wild  normal" — 38  standard 
normals  and  two  N(0,100). 

/S* 

Even  for  n  =  15  (and  =  1.15),  ®(i)  Performs  comparably  to 
the  bisquare.  Since  it  is  substantially  better  for  the  normal  and 
substantially  worse  for  the  1WN,  there  is  no  clear-cut  winner. 

Placing  a  lower  bound  on  1/n  E  ip1  (X  y  )  produces  incremental  changes 
similar  to  those  illustrated  in  Table  4.  Even  with  those  improve- 
ments  the  advantage  of  0^  over  the  bisquare  is  probably  not  large 
enough  to  warrant  its  use. 

By  sample  size  n  =  40,  the  advantage  of  the  adaptive  estima¬ 
tor  is  certainly  significant.  While  it  is  comparable  to  the  bisquare 
for  the  2WN,  0^.^  beats  the  bisquare  by  at  least  6  percent  for  both 

n  * 

the  normal  and  slash.  Thus  0^  quickly  approaches  the  ultimate 
advantage  which  is  achieved  as  n  00 . 

11.  Conclusions 

Perhaps  the  first  conclusion  of  any  robustness  study  should 
be  the  advice — it  is  not  so  important  which  robust  technique  you  use, 
as  long  as  you  do  use  one.  The  gains  from  using  a  new  and  improved 
robust  estimator  are  small  compared  to  the  risks  of  not  using  one 
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at  all.  On  the  other  hand,  the  search  for  improvements  over  existing 
methods  can  lead  to  very  important  insights. 

The  scale  factor  (or  its  inverse,  the  scale  parameter)  is  an 
important  determinant  of  the  efficiency  of  an  M-estimator  and  thus 
deserves  much  more  attention  than  it  has  received  in  the  past. 

Accurate  tuning  of  the  scale  factor  to  the  underlying  distribution 
function  can  lead  to  very  substantial  gains  in  asymptotic  efficiencies 
in  comparison  to  those  of  nonadaptive  M-estimators.  By  minimizing  the 
estimated  variance  as  a  function  of  the  scale  factor,  one  achieves  the 
best  possible  asymptotic  variance  using  any  scale  factor.  Furthermore, 
since  only  one  parameter  is  chosen  adaptively,  surprising  gains  are 
possible  for  samples  as  small  as  15  or  20.  Perhaps  this  fact  has 
gone  unnoticed  because  of  the  manipulations  in  Section  8  which  are 
required  to  make  the  estimator  acceptable. 

Adaptive  choice  of  the  scale  factor  also  offers  advantages  in 
the  reporting  stages  of  an  analysis.  If  F  is  normal,  there  is  a  good 

it 

chance  that  A  will  equal  0,  and  the  mean  can  be  used.  While 
analysts  using  other  robust  methods  might  also  realize  that  the  data 
supports  the  normal  hypothesis,  this  assessment  is  likely  to  depend 
on  a  subjective  look  at  the  data.  On  the  other  hand  as  A*  increases, 
one  has  more  evidence  that  the  normal  hypothesis  is  untenable. 

Mallows  (1979)  advocates  reporting  the  weights  that  each  observation 
has  on  the  analysis.  For  M-estimation  of  location,  these  are 
W.  =  i^[A(x^  -  0)  ]/A(x^  -  0)  .  When  A  is  chosen  adaptively,  the  set  of 
weights  is  much  more  likely  to  be  appropriate. 
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The  success  of  the  adaptive  M-estimator  of  location  suggests 
that  the  same  principle  has  merit  for  other  robust  estimation 
problems.  The  most  straightforward  generalization  is  to  robust  linear 
regression.  Residuals  from  an  initial  resistant  fit  would  replace 

/S 

-M  in  the  definition  of  V(A).  Issues  like  the  effect  of  p(the 
number  of  independent  variables)  on  the  small  sample  behavior  and  the 
sensitivity  of  the  estimator  to  the  initial  fit  will  require  more 
investigation.  Other  potential  uses  for  the  principle  include  robust 
estimation  in  general  one  parameter  models  (Huber;  1977;  p.  32)  and 
in  generalized  linear  models  (Pregibon,  1979) .  In  each  case  the 
weights  of  the  observations  can  depend  on  a  parameter  which  is  chosen 
adaptively. 
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APPENDIX 


*  A* 

A.  Program  to  Compute  A  and  6 ,, 


FMED  sample  median 

FMAD  sample  MAD 

PARAM  (1)  c 

n 

PARAM  (2)  tolerance  for  binary  search  =  PARAM  (2) /FMAD 
PARAM  (2)  =  0.06  is  sufficiently  small 
PARAM  (3)  p  for  4>  ;  if  PARAM  (3)  <_  0.5,  p  =  °° 

PARAM  (A)  minimum  value  for  1/n  £  Jp ' (Ay^) ;  see  Table  4,  Section  10 

IODBUG  if  greater  than  zero,  debugging  information  is  written 

* 

on  unit  IODBUG  at  each  step  of  search  for  A 
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SUBROUTINE  ESTIM  ( T2,FLMBDA, IOBBUG) 

COMMON  /PARAM/  PARAM ( A ) 

COMMON  /X/  X<100>»  Y<100)y  FMEEi  ?  FMAD  r  N 
C  A  =  LOWER  BOUND  FOR  ROOT  OF  < VilAT /  +  G) 

A  a  0 ♦ OOl/FMAD 
EPS  *  PARAM<2)/FMAD 
NN  ~  N  +  1 

C  FA  =  UHAT(A) 

C  GA  =  VHAT' (A)  +  G(A) 

CALL  FG(  As-FAs-GA) 

IF  < I0BBU6.LE.0)  GO  TO  155 
KK  »  155 

WRITE  < IOBBUG* 150)  NN»A?FA»GA»KK 
150  FORMAT  < I5» F8 ♦ 4 * 2F9 . 3 , 16) 

C  CHECK  FOR  ROOT  AT  ZERO 

155  IF  (GA*GE.O. AMD.Y(N)  ♦L.T.  100>i<FMAD>  GO  TO  210 
N.N  =  N 

160  B  «  1 ♦ 0/Y (NN) 

CALL  FG(Bj>FB»GB) 

IF  < lODBUG  *  LE ♦ 0 )  GO  TO  165 
KK  a  165 

WRITE  < IOBBUG 7 150)  NNrByFByGBrKK 
C  CHECK  FOR  ROOT  BETWEEN  A  AND  B 

165  IF  ( GB ♦  GE  *  0 )  GO  TO  180 

c; 

c:  GB  IS  STILL  NEGATIVE 

A  =  B 
GA  a  GB 
NN  a  NN  -  1 

IF  < 2&NN ♦ GT.N)  GO  TO  160 
FLMBDA  =  1 ♦ 0/Y ( NN+1 ) 

GO  TO  200 
C 

C  ROOT  BETWEEN  A  AND  B 

C  DG  BINARY  SEARCH  UNTIL  B-A  <  EPS 

180  BIFF  =  B  -  A 

IF  (DIFF.LT.EPS)  GO  TO  195 
AB  a  <A+B)/2»0 
CALL  FG(AB7FABvGAB) 

IF  (IODBUG.LE.O)  GO  TO  185 
KK  ~  185 

WRITE  (IODBUGflSO)  NN r  AB  y  F  AB  r GAB r KK 
185  IF(GAB.GE.O)  GO  TO  190 
(J 

C  ROOT  BETWEEN  AB  AND  B 

A  a  AB 
GA  =  GAB 
GO  TO  180 
C 

C  ROOT  BETWEEN  A  AND  AB 

190  B  a  AB 
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GB  =  GAB 
GO  TO  180 


C  ROOT  BETWEEN  A  AND  B i  B-A  <  EPS 

195  FLMBDA  =  B  -•  DIFFfcGB/ ( GB-GA ) 

200  SUH1  -  0*0 
SUM2  =0*0 
DO  205  J=1 t N 

Z  =  <  X(J)  -  FMED  )  *  FLMBDA 
CALL  PS  (Z?  PS  I  rF’SIl  »ZPSI2) 

SUM1  =  SUM1  i  P3I 
SUM2  =  SUM2  +  PS II 
205  CONTINUE 

T2  =  FMED  +  SUM1/<SUM2*FLMBDA> 

RETURN 

C 

C  ROOT  AT  ZERO 

C  T2  -  SAMPLE  MEAN 

210  FLMBDA  =  0 
T2  =••  0 

DO  215  J=1 f N 
T2  —  T2  +  X<J) 

215  CONTINUE 
T2  a  T2/N 
RETURN 
END 


SUBROUTINE  FG<A*FA?GA) 

DIMENSION  SUM (5) 

COMMON  /X/  X(100) rY< 100) r FMED rFMAD,N 
COMMON  /PARAM/  PARAM( 4) 

CN  =  PARAM ( 1 ) 

DO  120  M=2 1 5 
SUM  <  M )  =  0. 

120  CONTINUE 

DO  140  J=1 t N 
Z  «  <X<J)  -  FMED)  *  A 
CALL  PS  <Z»F‘SI  »PSI  1  »ZPSI2) 

PS  ISO  =  PSI*PSI 
ZPPSI1  =  Z*F'SI*PSI1 
SUM<2)  =  SUM ( 2 )  +  PS I SO 
SUM (3)  =  SUM ( 3 )  +  PSI1 
SUM<  4 )  =  SUM ( 4 )  1  ZPSI2  -  Z*Z#PSISQ*CN 
SUM(5)  =  SUM ( 5 )  +  ZPPSI1 
140  CONTINUE 

FA  =  N*SUM<2>  /  <A*A*SUM<3)*SUM<3> ) 

FACTOR  a  2*N  /  ( A$A# A*SUM  <  3 ) &SUH ( 3 )  ) 

GA  =  FACTOR  *  <  SUM( 5)  -  SUM<2)  -  SUM<4)*SUM<2)/SUM(3>  ) 
IF  <  SUM ( 3 ) . LT . N#PARAM ( 4 ) )  GA  =  10000. 

RETURN 

END 
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SUBROUTINE  PS  <  Z » PS I rPSIl, ZPS I 2 ) 

COMMON  /PARAM/  PARAMM) 

P  =  PARAM (3) 

ZSQ  =  Z>KZ 

IF  < P ♦ GT .0*5)  GO  TO  5 
P  a  INFINITY 

IF  ( ABS  <  Z ) . GT ♦ 8 ♦ )  GO  TO  10 
EX  =  EXP  (-0.5*ZSQ) 

PSI  =  Z&EX 

PSI1  =  <1.-ZSQ)*EX 

ZPS  1 2  =  ZSG*<ZSG-3.0>*EX 

RETURN 

P  >  0*5 
5  CONTINUE 

IF  (ABS(Z) .GT.1000)  GO  TO  10 
P2  =  2,0*P  -  1.0 
DZSIU  =  1.0  +  ZSQ/P2 
DZSQ1P  »  DZSG1**P 
PSI  =  Z/DZSQ1P 

PSI1  =  (i.O-ZSQ)  /  <HZSG1P*DZSQ1> 

ZPS 1 2  *  -2  * O^ZSCWP# ( 3  ♦ O-ZSQ )  /  <P2*DZSGiP*DZSQi*DZSGl > 
RETURN 
10  PSI  *  0 
PSI1  a  0 
ZPS I 2  a  0 
RETURN 
END 
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APPENDIX 


B.  Technical  Details  of  the  Monte  Carlo  Study 

In  Section  10  we  present  Monte  Carlo  results  comparing  various 
forms  of  the  adaptive  M-estimator  with  the  nonadaptive  bisquare.  In 
this  appendix  we  give  details  of  how  these  results  were  obtained. 

Most  of  the  numbers  presented  in  the  text  are  relative  efficiencies. 

By  using  two  well-known  variance  reduction  techniques,  the  standard 
errors  of  these  numbers  have  been  greatly  reduced  from  what  they 
otherwise  would  be.  The  first  technique  is  the  Monte  Carlo  swindle 
for  location  estimation,  and  the  second  is  the  use  of  common  streams 
of  pseudo  random  numbers  for  comparison  of  correlated  estimators; 
see,  e.g.,  Kleijnen  (1975). 

The  relative  efficiency  of  the  estimator  T2  to  ^  is  the 
2  2 

ratio  E(T^-0)  /E(T2~0)  .  We  estimate  this  by  the  ratio  of  esti¬ 
mated  expected  squared  errors.  The  expected  squared  error  of  T1  may 
be  estimated  naively  by  1/N  [T^  (x^  ^  )  -  0  ]  ^ ,  the  mean  of  squared 

errors  for  the  N  pseudo  random  samples.  A  great  reduction  from  the 
variance  of  this  estimate  is  made  possible  by  the  Monte  Carlo  swindle 
used  in  Andrews  et  al.  (1972;  Section  4B)  and  explained  in  detail  by 
Simon  (1976) .  The  simplest  example  of  the  swindle  is  for  normal 
samples.  Since  X  is  the  Pitman  estimator  and  T^  is  equivariant, 
EfT^X)-©]2  =  Ef^OO-X]2  +  E ( X -  0 ) 2  =  E[Tr(X)-x]2  +  1/n.  Because 

the  variance  of  [T^(X)  -  X]  is  usually  many  times  smaller  than  that 

2 

of  [T1(X)  -  Q ]  ,  the  swindle  is  very  useful.  The  general  swindle 
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reduces  the  variance  whenever  X  can  be  factored  into  a  normal  divided 

by  an  independent  random  variable.  When  was  the  bisquare  and 

n  -  20,  the  swindle  reduced  the  variance  of  the  estimate  of 
2 

EfTi(X)  -6]  by  factors  of  approximately  25,  32,  and  2.3  for  the 
normal,  1WN  and  slash  respectively.  Since  this  evaluation  of  the 
swindle  requires  estimation  of  E[T^(X)  —  0]  ,  we  receive  a  free  esti¬ 
mate  of  the  kurtosis  of  the  bisquare.  For  the  normal  and  1WN  the 
estimated  kurtosis  is  not  significantly  different  from  zero.  For  the 
slash,  however,  the  estimated  kurtosis  of  the  bisquare  is  0.8. 

The  second  variance  reduction  technique  uses  the  fact  that 
estimates  of  E^CX)  -0]2  and  E[T2(X)  -0]2  derived  from  the  same 
sequence  of  samples  are  likely  to  be  highly  correlated.  This  reduces 
the  variance  of  their  ratio,  the  estimated  relative  efficiency  of  T2 
t°  T^ ,  in  comparison  with  the  ratio  based  on  different  samples  for  T^ 
and  T2.  The  results  in  Table  1  of  Section  10  are  based  on  10,000, 
20,000,  and  100,000  samples  from  the  normal,  1WN  and  slash 
respectively.  In  order  to  obtain  an  estimate  of  the  standard  error 
of  the  estimated  relative  efficiency,  the  entire  Monte  Carlo  was 
divided  into  100  equal  parts.  For  each  sub-Monte  Carlo  of  100,  200, 
or  1000  samples  an  estimated  relative  efficiency  was  calculated. 

Since  the  estimated  relative  efficiency  of  T2  to  T^  is  approximately 
the  mean  of  the  100  separate  relative  efficiencies,  1/10  times  the 
standard  deviation  of  these  relative  efficiencies  provides  an  esti¬ 
mate  of  the  standard  error  of  the  overall  estimated  relative 
efficiency. 
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Achieving  the  precision  of  the  estimated  relative  efficiencies 


in  Table  1  required  a  sizeable  Monte  Carlo  study.  A  total  of  12  hours 

In 

order  to  reduce  the  computations  necessary  to  obtain  comparable  preci- 


on  a  PDP11/34A  was  required  to  estimate  the  triefficiency  of  0*^ 


A  . 

sion  of  the  triefficiencies  for  the  ten  variations  of  0*.^  appearing 
in  Tables  2-4,  we  employed  an  additional  variance  reduction  technique. 

Suppose  that  T^  is  a  small  modification  of  .  Then  the  per¬ 
formance  of  T^  relative  to  T^  will  be  correlated  with  the  performance 
of  relative  to  T^.  By  reusing  a  fraction  of  the  100  sets  of 
samples,  we  may  take  advantage  of  this  correlation  to  reduce  the  var¬ 
iance  of  the  estimate.  In  the  following  derivation,  m  =  20,  M  =  100, 
and  (Xi,Yi)  are  the  relative  efficiencies  of  T 2  to  T^  and  T3  to  T^ 
from  the  i-th  set  of  samples.  The  number  of  samples  in  each  sub- 
Monte  Carlo  is  large  enough  to  insure  approximate  normality  of  and 


Y.  . 

l 


Suppose  that  we  observe  ^  ,y2>  ,  . . .  ,  (*m,ym)  and 


where 
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are  independent  for  i  =  1,...,M.  We  wish  to  estimate  y  and  y  . 

x  y 

_  1  M  _  1  in 

While  x^  =  —  E.  x.  is  the  optimal  estimate  of  y  ,  y  =  —  E .  ,  y .  is 

m  M  i=l  i  ^  px  7m  m  i=l  Ji 

not  necessarily  the  best  estimator  of  y  . 


The  conditional  distribution  of  Y  given  X  is 


a  a 

where  a  =  y^  -  p  y^  and  $  =  p  Anderson  (1957)  showed  that  the 


a  '  x 

X 


x 


maximum  likelihood  estimate  of  y^  is  y^  =  a  +  Bx^  where  a  and  3  are 
the  least  squares  estimators  from  a  regression  of  y  on  x.  Thus 


(B.3) 


y  =  y  -  B(x  -x^) 
y  m  m  M 


and 

(B.4) 


Var(y  )  =  Var  E(y  lx)  +  E  Var(y  |x) 
y  x  y  ~  x  y  ~ 


a  a 

=  Var(y  -p^yx+.p^xM) 

X  X 


+  E (1  -  p 2)o2 


-  +  — 
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.  ,  l  m 
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a2  o2 

■  p2if  +  a-e2)  -i  *  <1-C'2>°2  E  „ 


S  (x .  -  x  ) 
.  i  i  m 
1=1 


Suppose  that  {x^,...,x^}  is  a  random  subset  of  {x^,...,x^}; 

_  2  _  2 

then  for  large  M,  x^-^-y^  and  (xm  -  x^)  /^^(x^x^)  is  approximately 

l/m(m  -  1)  times  a  random  variable  distributed  as  F  with  1  and  (m  -  1) 

degrees  of  freedom.  However,  (x, ,...,x  }  need  not  be  a  random  subset 

of  {x^,...,x^}.  By  the  nature  of  the  Monte  Carlo  procedure  we  may 

generate  {x^,...,x^}  and  then  Y^|x_^  for  any  values  of  i  that  we 

choose.  It  makes  sense  then  to  choose  (x, ,...,x  }  to  minimize 

1  m 

_  ___  2  m  _  2 

(xm  -  x^)  /E^_^(x_^  -  x^)  .  In  practice,  for  moderately  large  m  and  M, 
the  last  term  can  be  made  negligible.  Thus  we  have 
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(B.5) 


2  av  2  av 

Var(Uy)  *  P  if  +  (1-P  >  > 

so  that  Var(yy)  is  virtually  a  convex  combination  of  the  variance 

from  using  m  and  the  variance  from  using  M  observations.  The  larger 
2  . 

p  is,  the  greater  the  efficiency  of  the  procedure. 

2 

The  estimates  of  p  for  the  estimators  in  Tables  2-4  range 

from  0  to  0.99.  Two  estimators  (c2Q  =  0.0  and  the  adaptive  ^bg)  are 

practically  uncorrelated  with  the  basic  adaptive  estimator.  In  all 

other  cases  except  two  (^  5  on  the  1WN  and  c2Q  =  1.3  on  the  1WN)  the 
2 

estimate  of  p  is  at  least  0.50,  and  in  over  half  of  these  it  exceeds 
2 

0.85.  When  p  =0.85,  the  variance  from  reusing  20  sets  of  samples 
(such  that  ~  0)  is  the  same  as  that  from  62.5  new  sets  of 

samples.  Thus  62.5  percent  of  the  information  is  available  from  just 
20  percent  of  the  samples. 

There  is  another  important  benefit  of  reusing  samples.  The 
main  purpose  of  Tables  2-4  is  to  demonstrate  the  effects  of  various 
parameters  on  the  performance  of  9^.  Because  of  the  high  positive 
correlations  among  entries  in  the  same  columns  (even  across  tables) , 
the  standard  error  of  the  difference  between  two  estimated  relative 
efficiencies  is  usually  smaller  than  either  standard  error.  Of 
course,  this  phenomenon  is  most  pronounced  for  very  similar 
estimators . 
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